Download Differential Network Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Minimal genome wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Tutorial
Differential Network Analysis
Tova Fuller, Steve Horvath
Correspondence: [email protected], [email protected]
Abstract
Here we illustrate differential network analysis by comparing the connectivity and module
structure of two networks based on the liver expression data of lean and heavy mice. This
unbiased method for comparing two phenotypically distinct subgroups of mouse samples
serves as a method for understanding the underlying differential gene co-expression network
topology giving rise to altered biological pathways.
This work is in press:
Tova Fuller, Anatole Ghazalpour, Jason Aten, Thomas A. Drake, Aldons J. Lusis, Steve
Horvath (2007) Weighted gene coexpression network analysis strategies applied to
mouse weight. Mamm Genome, in press.
The data are described in:
Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas
A. Drake, Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network
Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics
We provide the statistical code used for generating the weighted gene co-expression network
results. Thus, the reader be able to reproduce all of our findings. This document also serves as
a tutorial to differential weighted gene co-expression network analysis. Some familiarity with
the R software is desirable but the document is fairly self-contained. This document and data
files can be found at the following webpage:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysi
s
More material on weighted network analysis can be found here:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Method Description:
The data are described in the PLoS article cited above [1]. Please also refer to the citations
above and below for more information regarding weighted gene co-expression network
analysis (WGCNA).
Here we attempt to show that networks may be constructed from two phenotypically different
subgroups of samples from a prior WGCNA experiment on mice. Here we identify 30 mice at
both extremes of the weight spectrum in the BxH data and construct weighted gene coexpression networks from each.
Network Construction:
In co-expression networks, network nodes correspond to genes and connection strengths are
determined by the pairwise correlations between expression profiles. In contrast to
unweighted networks, weighted networks use soft thresholding of the Pearson correlation
matrix for determining the connection strengths between two genes. Soft thresholding of the
Pearson correlation preserves the gene co-expression information and leads to weighted coexpression networks that are highly robust with respect to the construction method [2].
The network construction algorithm is described in detail elsewhere [2]. Briefly, a gene coexpression similarity measure (absolute value of Pearson’s product moment correlation) was
used to relate every pairwise gene-gene relationship. An adjacency matrix was then
constructed using a `soft’ power adjacency function aij = Power(sij, )  |sij|  where sij is the
co-expression similarity, and aij represents the resulting adjacency that measures the
connection strengths. The power  is chosen using the scale free topology criterion proposed
in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network
exhibited approximate scale free topology and a high mean number of connections. The scale
free topology criterion led us to choose a power of  = 6 based on the preliminary network
built from the 8000 most varying genes. However, since we are using a weighted network as
opposed to an unweighted network, the biological findings are highly robust with respect to
the choice of this power [2].
Topological Overlap Matrix and Gene Modules
The adjacency matrix was then used to define a network distance measure or more precisely a
measure of node dissimilarity based on the topological overlap matrix [2]. Specifically the
topological overlap matrix is given by
lij  aij
ij 
min{ki , k j }  1  aij
where lij   aiu auj denotes the number of nodes to which both i and j are connected, and u
u
indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ωij].
ωij is a number between 0 and 1 and is symmetric (i.e, ωij= ωji). The rationale for considering
this similarity measure is that nodes that are part of highly integrated modules are expected to
have high topological overlap with their neighbors.
Network Module Identification.
Gene "modules" are groups of nodes that have high topological overlap. Module identification
was based on the topological overlap matrix Ω=[ωij] defined above. To use it in hierarchical
clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the
topological overlap based dissimilarity measure is defined by dij  1  ij ). Based on the
dissimilarity matrix we can use hierarchical clustering to discriminate one module from
another. We used a dynamic cut-tree algorithm for automatically and precisely identifying
modules in hierarchical clustering dendrogram (the details of the algorithm could be found at
http://www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut).
The algorithm takes into account an essential feature of cluster occurrence and makes use of
the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive
process of cluster decomposition and combination and the process is iterated until the number
of clusters becomes stable. No claim is made that our module construction method is optimal.
A comparison of different module construction methods is beyond the scope of this paper.

Network Comparison Measures:
For the ith gene, we denote the whole network connectivity by k1(i) and k2(i) in networks 1
and 2, respectively. To facilitate the comparison between the connectivity measures of each
network, we divide each gene connectivity by the maximum network connectivity, i.e.
k (i)
k2 (i)
and K 2 (i) 
. We utilize the following measure of differential
K1(i)  1
max( k1 )
max( k2 )
connectivity as DiffK(i) = K1(i) – K2(i), but other measures of differential connectivity could
also be considered. To measure differential gene expression between the lean and heavy mice,
we use the absolute value of the Student t-test statistic.

Sector plots and permutation test:
Plotting DiffK versus the t-test statistic value for each gene gives a visual demonstration of
how difference in connectivity relates to a more traditional t-statistic describing difference in
expression between the two networks.
To determine whether membership in each of these sectors was significant, I permute high or
low weight status among our 60 mice in Networks 1 and 2, and reconstruct each network
accordingly. I obtained values DiffKperm and ttestperm as shown above, taking into account
permuted high or low weight status.
This permutation process was repeated 1000 times (no.perm = 1000), each time noting the
amount of genes that could be found in each of the m = 8 sectors as n mperm . Similarly, the
sector counts for unpermuted, observed data was noted as n mobs. Then, I found a value Nm for
each sector representing the number of iterations for which n mobs  n mperm . The p-value for each
Nm  1

sector was found as
.
no.perm  1


Functional Analysis
Notable genes were analyzed for pathway enrichment using the Database for Annotation,

Visualization,
and
Integrated
Discovery
(DAVID)
[3]
(http://david.niaid.nih.gov/david/ease.htm).
1.
2.
3.
Ghazalpour, A., et al., Integrating genetic and network analysis to characterize genes
related to mouse weight. PLoS Genet, 2006. 2(8): p. e130.
Zhang, B. and S. Horvath, A general framework for weighted gene co-expression
network analysis. Stat Appl Genet Mol Biol, 2005. 4: p. Article17.
Dennis, G., Jr., et al., DAVID: Database for Annotation, Visualization, and Integrated
Discovery. Genome Biol, 2003. 4(5): p. P3.
Statistical References
To cite this tutorial or the statistical methods please use
1. Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression
Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4:
No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17
For the generalized topological overlap matrix as applied to unweighted networks see
2. Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications
in Gene Co-expression Networks. Proceedings Volume. Biocomp Conference 2006,
Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/.
For some additional theoretical insights consider
3. Horvath S, Dong J, Yip A (2006) The Relationship between Intramodular Connectivity
and Gene Significance. Proceedings Volume. Biocomp Conference 2006, Las Vegas.
Technical report at http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/
4. Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and
Other Network Concepts in Co-expression Networks. Submitted.
# Absolutely no warranty on the code. Please contact TF or SH with suggestions.
# Downloading the R software
# 1) Go to http://www.R-project.org, download R and install it on your computer
# After installing R, you need to install several additional R library packages:
# For example to install Hmisc, open R,
# go to menu "Packages\Install package(s) from CRAN",
# then choose Hmisc. R will automatically install the package.
# When asked "Delete downloaded files (y/N)? ", answer "y".
# Do the same for some of the other libraries mentioned below. But note that
# several libraries are already present in the software so there is no need to re-install them.
# To get this tutorial and data files, go to the following webpage
#
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysi
s
# Download the zip file containing:
# 1) R function file: "NetworkFunctions.txt", which contains several R functions
#
needed for Network Analysis.
# Unzip all the files into the same directory,
## The user should copy and paste the following script into the R session.
## Text after "#" is a comment and is automatically ignored by R.
# read in the R libraries
library(MASS)
# standard, no need to install
library(class)
# standard, no need to install
library(cluster)
library(sma)
# install it for the function plot.mat
library(impute)# install it for imputing missing value
library(scatterplot3d)
# Note: alter the following file paths to point towards where your files are.
source("/Users/TovaFuller/Documents/HorvathLab2006/NetworkFunctions/Network
Functions.txt")
setwd("/Users/TovaFuller/Documents/HorvathLab2007/MouseProject2.0/DiffNetwo
rkAnalysis/")
# The following 3421 probe set were arrived at using the following steps
#1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes
dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodule
s.xls",header=T)
names(dat0)
# this contains information on the genes
datSummary=dat0[,c(1:8,144:150)]
# the following data frame contains the gene expression data: columns are genes, rows are
# arrays (samples)
datExpr <- t(dat0[,9:143])
no.samples <- dim(datExpr)[[1]]
dimnames(datExpr)[[2]]=datSummary[,1]
dim(datExpr)
# We read in the clinical data
datClinicalTraits=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",heade
r=T)
#Now we order the mice so that trait file and expression file agree
restrictMice=is.element(datClinicalTraits$MiceID,dimnames(datExpr)[[1]])
table(restrictMice)
datClinicalTraits=datClinicalTraits[restrictMice,]
orderMiceTraits=order(datClinicalTraits$MiceID)
orderMiceExpr=order(dimnames(datExpr)[[1]])
datClinicalTraits =datClinicalTraits[orderMiceTraits,]
datExpr =datExpr[orderMiceExpr,]
# From the following table, we verify that all 135 mice are in order
table(datClinicalTraits$MiceID==dimnames(datExpr)[[1]])
rm(dat0);collect_garbage()
BodyWeight=as.numeric(datClinicalTraits$WeightG)
# Now we find the 30 heaviest and leanest mice. Network 1 will refer to modules defined by
# the 30 leanest mice, and Network 2 will refer to modules defined by the 30 heaviest mic.
rest1=rank(BodyWeight,ties="first")<=30
rest2=rank(-BodyWeight,ties="first")<=30
# We check to make sure there are 30 of each, with no overlap between the groups
table(rest1)
table(rest2)
table(rest1==T & rest2==T)
# FALSE
#
135
# We separate the expressions for the fat and lean groups:
datExpr1=datExpr[rest1,]
datExpr2=datExpr[rest2,]
# Now we find the whole network connectivity measures for each:
k1=SoftConnectivity(datExpr1,6)
k2=SoftConnectivity(datExpr2,6)
# We would like to normalize the connectivity measures. We do this by dividing by the
# maximum values.
K1=k1/max(k1)
K2=k2/max(k2)
# Now we find the difference between the two connectivity values.
DiffK=K1-K2 # Note that we did not take the absolute value here.
# Negative values of this difference imply that normalized Lean connectivity (k2) is
# greater than K1, fat normalized connectivity.
poolRest=as.logical(rest1+rest2)
factorLevels=rep(NA,length(poolRest))
factorLevels[rest1]="lowWeight" # trait 1 is low weight
factorLevels[rest2]="highWeight" # trait 2 is high weight
ttest=rep(NA, dim(datExpr)[[2]])
# Let's determine the t-statistic.
for (i in 1:dim(datExpr)[[2]]){
ttest[i]=t.test(datExpr[poolRest,i]~factorLevels[poolRest])$statistic
}
# Permuted Status
# Let's create a control DiffK with permuted fat/lean status. We choose from the same
# mice that were considered in the fat and lean groups. Note results from this following
# analysis will be different each time because we are sampling randomly.
temp1=sample(which(poolRest==T),30,replace=F)
Rest1=rep(F,length(poolRest))
Rest1[temp1]=T
table(Rest1)
Rest2=as.logical(poolRest-Rest1)
table(Rest2)
table(Rest1==T & Rest2==T)
# FALSE
#
135 good!
factorLevels2=rep(NA,length(poolRest))
factorLevels2[Rest1]="permLowWeight"
factorLevels2[Rest2]="permHighWeight"
datExpr1.2=datExpr[Rest1,]
datExpr2.2=datExpr[Rest2,]
# Now we find the whole network connectivity measures for each:
k1.2=SoftConnectivity(datExpr1.2,6)
k2.2=SoftConnectivity(datExpr2.2,6)
# We would like to normalize the permuted connectivity measures. We do this by
# dividing by the maximum values.
K1.2=k1.2/max(k1.2)
K2.2=k2.2/max(k2.2)
# Now we find the difference between the two connectivity values.
DiffK.2=K1.2-K2.2 # Note that we did not take the absolute value here.
ttest2=rep(NA, dim(datExpr)[[2]])
# Let's determine the t-statistic.
for (i in 1:dim(datExpr)[[2]]){
ttest2[i]=t.test(datExpr[poolRest,i]~factorLevels2[poolRest])$statistic
}
# Plotting DiffK versus ttest in Unpermuted and Permuted
par(mfrow=c(1,2))
plot(DiffK,ttest, main=paste("Unpermuted: cor=", signif(cor(DiffK,ttest,
use="pairwise.complete.obs") ,3)),xlim=range(DiffK),ylim=range(ttest)) #
colorGroup1 ??
abline(h=1.96)
abline(h=-1.96)
abline(v=.4)
abline(v=-.4)
plot(DiffK.2,ttest2, main=paste("Permuted: cor=",
signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)),
xlim=range(DiffK),ylim=range(ttest))
# colorGroup1 ??
abline(h=1.96)
abline(h=-1.96)
abline(v=.4)
abline(v=-.4)
1
2
# We repeat this permutation 1000 times, each time
# counting the amount of "dots" in the quadrants
# shown at left.
4
8
7
3
6
5
# Now we wish to obtain a p-value for the DiffK
# when there is permutation versus when there is
# not.
# Here are the sector counts for unpermuted:
sector1obs = (ttest>1.96) & (DiffK<(-0.4))
n1obs=sum(sector1obs)
sector2obs = (ttest>1.96) &(DiffK>(-0.4)) & (DiffK<0.4)
n2obs=sum(sector2obs)
sector3obs = (ttest>1.96) & (DiffK>0.4)
n3obs=sum(sector3obs)
sector4obs = (ttest>(-1.96)) & (ttest<1.96) & (DiffK>0.4)
n4obs=sum(sector4obs)
sector5obs = (ttest<(-1.96)) & (DiffK>0.4)
n5obs=sum(sector5obs)
sector6obs = (ttest<(-1.96)) &(DiffK>(-0.4)) & (DiffK<0.4)
n6obs=sum(sector6obs)
sector7obs = (ttest<(-1.96)) & (DiffK<(-0.4))
n7obs=sum(sector7obs)
sector8obs = (ttest>(-1.96)) & (ttest<1.96) & (DiffK<(-0.4))
n8obs=sum(sector8obs)
# We will find a vector of these same counts for each permutation, but the niobs value
# will always stay the same (scalar)
no.perms=1000
# We started with 100, and then redid the experiment with no.perms=1000
n1perm=rep(NA,no.perms)
n2perm=rep(NA,no.perms)
n3perm=rep(NA,no.perms)
n4perm=rep(NA,no.perms)
n5perm=rep(NA,no.perms)
n6perm=rep(NA,no.perms)
n7perm=rep(NA,no.perms)
n8perm=rep(NA,no.perms)
# We turn now to obtaining sector counts for each permutation
for (i in c(1:no.perms)) {
temp1=sample(which(poolRest==T),30,replace=F)
Rest1=rep(F,length(poolRest))
Rest1[temp1]=T
Rest2=as.logical(poolRest-Rest1)
factorLevels2=rep(NA,length(poolRest))
factorLevels2[Rest1]="fakefat"
factorLevels2[Rest2]="fakelean"
datExprFat.2=datExpr[Rest1,]
datExprLean.2=datExpr[Rest2,]
k1.2=SoftConnectivity(datExprFat.2,6)
k2.2=SoftConnectivity(datExprLean.2,6)
K1.2=k1.2/max(k1.2)
K2.2=k2.2/max(k2.2)
DiffK.2=K1.2-K2.2 # Note that we did not take the absolute value here.
ttest2=rep(NA, dim(datExpr)[[2]])
# Let's determine the t-statistic.
for (j in 1:dim(datExpr)[[2]]){
ttest2[j]=t.test(datExpr[poolRest,j]~factorLevels2[poolRest])$statistic
}
sector1perm = (ttest2>1.96) & (DiffK.2<(-0.4))
n1perm[i]=sum(sector1perm)
sector2perm = (ttest2>1.96) &(DiffK.2>(-0.4)) & (DiffK.2<0.4)
n2perm[i]=sum(sector2perm)
sector3perm = (ttest2>1.96) & (DiffK.2>0.4)
n3perm[i]=sum(sector3perm)
sector4perm = (ttest2>(-1.96)) & (ttest2<1.96) & (DiffK.2>0.4)
n4perm[i]=sum(sector4perm)
sector5perm = (ttest2<(-1.96)) & (DiffK.2>0.4)
n5perm[i]=sum(sector5perm)
sector6perm = (ttest2<(-1.96)) &(DiffK.2>(-0.4)) & (DiffK.2<0.4)
n6perm[i]=sum(sector6perm)
sector7perm = (ttest2<(-1.96)) & (DiffK.2<(-0.4))
n7perm[i]=sum(sector7perm)
sector8perm = (ttest2>(-1.96)) & (ttest2<1.96) & (DiffK.2<(-0.4))
n8perm[i]=sum(sector8perm)
}
logicalSum1=sum(n1obs<=n1perm)
logicalSum2=sum(n2obs<=n2perm)
logicalSum3=sum(n3obs<=n3perm)
logicalSum4=sum(n4obs<=n4perm)
logicalSum5=sum(n5obs<=n5perm)
logicalSum6=sum(n6obs<=n6perm)
logicalSum7=sum(n7obs<=n7perm)
logicalSum8=sum(n8obs<=n8perm)
pval1=(logicalSum1+1)/(no.perms+1)
pval1
pval2=(logicalSum2+1)/(no.perms+1)
pval2
pval3=(logicalSum3+1)/(no.perms+1)
pval3
pval4=(logicalSum4+1)/(no.perms+1)
pval4
pval5=(logicalSum5+1)/(no.perms+1)
pval5
pval6=(logicalSum6+1)/(no.perms+1)
pval6
pval7=(logicalSum7+1)/(no.perms+1)
pval7
pval8=(logicalSum8+1)/(no.perms+1)
pval8
# For no.perms=100:
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
pval1
[1] 0.5940594
pval2
[1] 0.00990099
pval3
[1]0.00990099
pval4
[1] 0.930693
pval5
[1] 0.00990099
pval6
[1] 0.00990099
pval7
[1] 0.4950495
pval8
[1] 0.8118812
# For no.perms=1000
pval1
[1] 0.4645355
pval2
[1] 0.000999001
pval3
[1] 0.000999001
pval4
[1] 0.947053
pval5
[1] 0.00999001
pval6
[1] 0.000999001
pval7
[1] 0.3346653
pval8
[1] 0.7972028
# Because we are dividing by no.perms, the p-values are 10-fold different for 100 or 1000
# iterations.
# The result we achieve makes sense; regions with significantly high or low ttest values
# (p < 0.05) tend to be significant. However, since there are few data points that have a
# DiffK value < -0.4, the upper and lower right hand corner sectors do not prove
# significant. This could be because DiffK is defined as "normalized low weight
# connectivity" minus "normalized high weight connectivity" and there are few genes that in
# fact are more connected in the fat than in the lean that also have extreme ttest scores.
# Module Identification
beta1=6
# Note: we use a beta1 = 6 as this was the value used in our previous analysis
# Finding hierGTOM in our Network 1 (low weight).
datExpr1=data.frame(datExpr1)
AdjMat1=abs(cor(datExpr1,use="p"))^beta1
collect_garbage()
dissGTOM1=TOMdist1(AdjMat1)
collect_garbage()
hierGTOM1 <- hclust(as.dist(dissGTOM1),method="average")
par(mfrow=c(1,1))
plot(hierGTOM1,labels=F)
rm(AdjMat1)
# Finding hierGTOM in our Network 2 (high weight).
datExpr2=data.frame(datExpr2)
AdjMat2=abs(cor(datExpr2,use="p"))^beta1
collect_garbage()
dissGTOM2=TOMdist1(AdjMat2)
collect_garbage()
hierGTOM2 <- hclust(as.dist(dissGTOM2),method="average")
par(mfrow=c(1,1))
plot(hierGTOM2,labels=F)
rm(AdjMat2)
# Dynamic Cut-Tree Algorithm
# "We used a dynamic cut-tree algorithm for selection branches of the hierarchical
# clustering dendrogram (the details of the algorithm can be found at the following link:
# www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut. The algorithm takes
# into account an # essential feature of cluster occurrence and makes use of the internal
# structure in a dendrogram. Specifically, the algorithm is based on an adaptive process
# of cluster decomposition and combination and the process is iterated until the number
# of clusters becomes stable. " (From first mouse tutorial)
myheightcutoff =0.98
mydeepSplit = FALSE # fine structure within module
myminModuleSize = 90 # modules must have this minimum number of genes
# Note that we get a "too many modules" warning if we have a smaller minimum size, or
# higher height cutoff.
# new way for identifying modules based on hierarchical clustering dendrogram
color1=cutreeDynamic(hierclust=hierGTOM1,
deepSplit=mydeepSplit,maxTreeHeight
=myheightcutoff,minModuleSize=myminModuleSize)
table(color1)
color1
turquoise
355
black
218
salmon
113
blue
brown
336
271
pink
magenta
184
155
cyan midnightblue
110
92
yellow
255
purple
139
grey
485
green
241
greenyellow
126
red
224
tan
117
pdf(file="modules1.pdf")
par(mfrow=c(2,1))
plot(hierGTOM1, main="BXH Mouse Network 1",labels=F,xlab="",sub="");
hclustplot1(hierGTOM1,color1,title1="Colored by Network 1 Modules")
dev.off()
# Now we define modules in the Network 2.
color2=cutreeDynamic(hierclust=hierGTOM2,
deepSplit=mydeepSplit,maxTreeHeight
=myheightcutoff,minModuleSize=myminModuleSize)
table(color2)
color2
turquoise
337
black
175
salmon
102
grey
612
blue
brown
243
234
pink
magenta
172
161
cyan midnightblue
101
100
yellow
210
purple
116
lightcyan
98
green
184
greenyellow
108
grey60
92
red
184
tan
102
lightgreen
90
pdf(file="modules2.pdf")
par(mfrow=c(2,1))
plot(hierGTOM2, main="BXH Mouse Network 2",labels=F,xlab="",sub="");
hclustplot1(hierGTOM2,color2,title1="Colored by Network 2 Modules")
dev.off()
# We also want to look at modules in Network 2 colored by Network 1:
pdf(file="modules2by1.pdf")
par(mfrow=c(2,1))
plot(hierGTOM2, main="BXH Mouse Network 2",labels=F,xlab="",sub="");
hclustplot1(hierGTOM2,color1,title1="Colored by Network 1 Modules")
dev.off()
# We look at DiffK versus ttest in different modules.
par(mfrow=c(1,2))
plot(DiffK,ttest,col=as.character(color1),main="Colored by Network 1")
plot(DiffK,ttest,col=as.character(color2),main="Colored by Network 2")
# Here we create a Network 1 plot that is colored by module definitions, with sectors
# delineated.
par(mfrow=c(1,1))
plot(DiffK,ttest,col=as.character(color1),main=paste("Colored by Network 1
Modules, cor=", signif(cor(DiffK,ttest,use="pairwise.complete.obs"),3)),
xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic")
abline(h=1.96)
abline(h=-1.96)
abline(v=.4)
abline(v=-.4)
text(c(-0.5,-0.2,0.7,0.7,0.7,-0.2,-0.5,-0.5),c(7,7,7,0,-9,-9,9,0),labels=c("1","2","3","4","5","6","7","8"),cex=3)
# Here we create a Network 2 plot.
par(mfrow=c(1,1))
plot(DiffK.2,ttest2, main=paste("Permuted, cor=",
signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)),
xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic")
abline(h=1.96)
abline(h=-1.96)
abline(v=.4)
abline(v=-.4)
# Here we create the permuted plot that appears in the Mammalian Genome article:
par(mfrow=c(1,2),cex.lab=1.3,mar=c(5,5,4,2)+0.1)
plot(DiffK,ttest,col=as.character(color1),main=paste("Colored by Network 1
Modules, cor=", signif(cor(DiffK,ttest,use="pairwise.complete.obs"),3)),
xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic")
abline(h=1.96)
abline(h=-1.96)
abline(v=.4)
abline(v=-.4)
text(c(-0.5,-0.2,0.7,0.7,0.7,-0.2,-0.5,-0.5),c(7,7,7,0,-9,-9,9,0),labels=c("1","2","3","4","5","6","7","8"),cex=3)
plot(DiffK.2,ttest2, main=paste("Permuted, cor=",
signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)),
xlim=range(DiffK),ylim=range(ttest),xlab="Difference in
Connectivity",ylab="t-test statistic")
abline(h=1.96)
abline(h=-1.96)
abline(v=.4)
abline(v=-.4)
# FUNCTIONAL ANALYSIS OF HIGHLY DIFFERENTIAL MODULES
The yellow module in lean mice and the grey module in fat mice is particularly interesting
because it is member to sector 3 – which is found to be significantly different in DiffK versus
ttest at the p=9.9e-4 level. Furthermore, high connectivity in low weight mice in alongside
low connectivity in high weight mice demonstrates that network structure either affects or is
affected by mouse weight. We use an EASE analysis of these genes, looking at:
1. yellow genes as defined by lean modules
2. the intersection of yellow genes as defined by network 1 modules and grey genes
defined by network 2 modules
3. #1, limited by sector 3 boundaries
4. #2, limited by sector 3 boundaries.
# Let's find the genes we are interested in:
# Number 1 (see above)
Yellow1=color1=="yellow"
table(Yellow1)
# Yellow1
# FALSE TRUE
# 3166
255
Grey2=color2=="grey"
table(Grey2)
# Grey2
# FALSE TRUE
# 2809
612
# Number 2
Y1G2=Yellow1 & Grey2
table(Y1G2)
# FALSE TRUE
# 3250
171
# Number 3
Y1S3= ((ttest>1.96) & (DiffK>0.4) & Yellow1)
table(Y1S3)
# YleanS3
# FALSE TRUE
# 3342
79
# Number 4
Y1G2S3=ttest>1.96 & (DiffK>0.4) & Y1G2
table(Y1G2S3)
# Y1G2S3
# FALSE TRUE
# 3360
61
# dimnames(datExpr)[[2]] are the gene names. We can use the "locus link
# IDs", otherwise known as Entrez Gene IDs in the DAVID database to get a functional
# annotation.
EntrezGeneIDs=datSummary[,3]
# Background
write.table(EntrezGeneIDs,"background.txt",sep=" ",row.names=F,col.names=F)
# Number 1
write.table(datSummary[,3][Yellow1],"Yellow1.txt", sep="
",row.names=F,col.names=F)
# Number 2
write.table(datSummary[,3][Y1G2],"Y1G2.txt", sep="
",row.names=F,col.names=F)
# Number 3
write.table(datSummary[,3][Y1S3],"Y1S3.txt", sep="
",row.names=F,col.names=F)
# Number 4
write.table(datSummary[,3][Y1G2S3],"Y1G2S3.txt", sep="
",row.names=F,col.names=F)
Because a 0 is placed in lieu of a missing ID, I manually opened these files and deleted "0"
entries. Then I used the entire gene list of the 3421 most connected genes as a
background and entered this information into DAVID for a functional analysis.
In the yellow lean gene list, 250 of the 255 yellow lean genes were recognized as mus
musculus genes. 168 of yellow lean / grey fat out of 171 were recognized. All of the sector 1
genes, whether only yellow lean genes, or the intersection of lean and fat genes, were
recognized.
The smallest group of genes – those that met criteria 4 were highly specific for extracellular
matrix, extracellular region, epidermal growth factors and its related proteins, cell adhesion,
and glycoproteins. On the next page you will find a table of these genes sorted by increasing
p-value. Genes with p-value < 0.05 are shaded.
In criteria 4 we added the additional requirement of being a grey fat gene. The genes that do
not meet this requirement (only meet criteria 3, being yellow genes in sector 1) are involved in
blood vessel morphogenesis, metalloprotease inhibition, anion transport, or other similar
functions seen in Criteria 4.
CRITERIA 4 GENE FUNCTIONAL ANNOTATION – yellow in network 1, grey in network
2, sector 3
Category
Term
GOTERM_CC_ALL
extracellular region
Count
23
37.70%
1.75E-04
SP_PIR_KEYWORDS
UP_SEQ_FEATURE
signal
signal peptide
22
22
36.07%
36.07%
3.55E-04
5.39E-04
GOTERM_CC_ALL
GOTERM_BP_ALL
extracellular space
cell adhesion
21
10
34.43%
16.39%
5.69E-04
7.69E-04
UP_SEQ_FEATURE
UP_SEQ_FEATURE
domain:EGF-like 1
glycosylation site:N-linked (GlcNAc...)
5
21
8.20%
34.43%
8.66E-04
0.00123567
SP_PIR_KEYWORDS
UP_SEQ_FEATURE
glycoprotein
domain:EGF-like 3
21
4
34.43%
6.56%
0.001564983
0.001605736
SP_PIR_KEYWORDS
SP_PIR_KEYWORDS
cell adhesion
collagen
7
5
11.48%
8.20%
0.001704503
0.001819918
GOTERM_BP_ALL
UP_SEQ_FEATURE
cell-cell adhesion
domain:EGF-like 2
5
4
8.20%
6.56%
0.00376904
0.006019022
SP_PIR_KEYWORDS
GOTERM_MF_ALL
immunoglobulin domain
extracellular matrix structural constituent
7
4
11.48%
6.56%
0.006493147
0.009692797
INTERPRO_NAME
UP_SEQ_FEATURE
IPR007110:Immunoglobulin-like
disulfide bond
7
16
11.48%
26.23%
0.010018458
0.011240382
SMART_NAME
SP_PIR_KEYWORDS
SM00181:EGF
extracellular matrix
5
5
8.20%
8.20%
0.012956896
0.014607793
SMART_NAME
SP_PIR_KEYWORDS
SM00179:EGF_CA
egf-like domain
4
5
6.56%
8.20%
0.014804082
0.016893234
GOTERM_MF_ALL
INTERPRO_NAME
structural molecule activity
IPR000742:EGF-like, type 3
7
5
11.48%
8.20%
0.017782941
0.018214759
INTERPRO_NAME
INTERPRO_NAME
IPR006210:EGF
IPR001881:EGF-like calcium-binding
5
4
8.20%
6.56%
0.02067224
0.02409503
INTERPRO_NAME
INTERPRO_NAME
IPR006209:EGF-like
IPR013032:EGF-like region
5
5
8.20%
8.20%
0.029234871
0.029234871
GOTERM_BP_ALL
INTERPRO_NAME
regulation of signal transduction
IPR013106:Immunoglobulin V-set
4
4
6.56%
6.56%
0.033918607
0.043000016
INTERPRO_NAME
GOTERM_BP_ALL
IPR013111:EGF, extracellular
negative regulation of signal transduction
3
3
4.92%
4.92%
0.043093768
0.044205578
GOTERM_CC_ALL
GOTERM_CC_ALL
extracellular matrix (sensu Metazoa)
extracellular matrix
5
5
8.20%
8.20%
0.0465188
0.0465188
SMART_NAME
UP_SEQ_FEATURE
SM00643:C345C
domain:VWFC
2
2
3.28%
3.28%
0.060483805
0.067182734
INTERPRO_NAME
INTERPRO_NAME
IPR013091:EGF calcium-binding
IPR008160:Collagen triple helix repeat
3
3
4.92%
4.92%
0.074400865
0.0801662
SP_PIR_KEYWORDS
GOTERM_BP_ALL
calcium binding
phosphate transport
3
3
4.92%
4.92%
0.082536708
0.088210508
UP_SEQ_FEATURE
SP_PIR_KEYWORDS
domain:TSP N-terminal
duplication
2
3
3.28%
4.92%
0.088585807
0.088868263
PIR_SUPERFAMILY_NAME
GOTERM_BP_ALL
SF005770:regulator of G protein signaling, RGS4 type
organ morphogenesis
2
5
3.28%
8.20%
INTERPRO_NAME
IPR001134:Netrin, C-terminal
2
3.28%
write.table(datSummary[Y1G2S3,],"Y1G2S3summary.csv",sep=",")
%
PValue
0.089493679
0.094055231
0.095041582