Download Weighted gene co-expression network analysis (WGCNA) and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of neurodegenerative diseases wikipedia , lookup

Heritability of IQ wikipedia , lookup

Population genetics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Human genetic variation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Twin study wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Ridge (biology) wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Gene therapy wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Public health genomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
Weighted gene co-expression network
analysis (WGCNA) and network edge
orienting (NEO)
Part I:
WGCNA
Part II:
NEO
Bin Zhang and Steve Horvath
University of California, Los Angeles, USA
Departments of Human Genetics and Biostatistics
s
c
i
t
s
i
t
a
t
s
o
Bi
Challenges of Modern Genetics
1. Genetic analysis of complex diseases is difficult
•
•
•
Requires searching for many small effect genes
Difficult to detect signal at the DNA level
RNA level day may identify clusters of genes
2. Microarray technology – measures RNA levels
(gene expression)
•
•
But this data is noisy!
Focusing on single genes can lead to spurious results
due to outliers or array artifacts
 Network analysis of RNA data:
“Gene Co-expression Network Analysis” (GCNA)
s
c
i
t
s
i
t
a
t
s
o
Bi
Scale-free Networks:
Derek J de Solla Price
• Derek J de Solla Price was a professor
of applied mathematics at Raffles
College which became part of the
University of Singapore in 1948.
• Singapore = great location for a systems
biology conference!
• In 1965 he published the first example
of a scale-free network.
• The network of scientific journal articles
has connections (citations) that follow a
power-law distribution.
s
c
i
t
s
i
t
a
t
s
o
Bi
Timeline for Scale-Free Gene CoExpression Networks
1965: Concept first conceived by
Derek J. de Solla Price
1999: Resurrected by Barabasi and
Albert by discovering its applicability
for modeling the internet and
biological networks.
2000: The concept of
modeling gene expression
data as a network was
introduced by Butte and
Kohane.
2002: Featherstone and
Broadie showed that these
networks exhibited scale-free
topology.
Gene Co-Expression Network Analysis
(GCNA) = Systems Genetics Approach
• Goal is to understand the “system” instead of
reporting a list of individual parts
• Focus on gene clusters: “modules” rather than
individual genes
• Easily integrated with other types of data: genetic
marker and protein data, clinical traits
• Network structure translates to biological
pathways (can be confirmed and annotated using
gene ontology software)
s
c
i
t
s
i
t
a
t
s
o
Bi
GCNA addresses issues in microarray data
& complex disease genetics
• Individual gene expressions may be poorly
measured, so it is safer to study this data at the
module level.
• Modules are likely to represent pathways –
genes which are co- regulated and/or interact.
• The signal from these pathways tends to be
stronger than the signal from a single gene.
• Alleviates multiple testing problem in traditional
association/differential expression analyses.
s
c
i
t
s
i
t
a
t
s
o
Bi
Network Terminology
Definitions:
Node = objects (ex. Genes)
Connection = link between 2 nodes
k = Degree(Nodei) = # of links to Nodei
Pr(k) = probability Nodei has k links.
e 

Pr( k ) 
k!
k
Pr(k ) k

Barabási AL, Oltvai ZN (2004). Network biology:
Understanding the cell's functional organization.
Nature reviews genetics, 5, 101-113.
(A) Random Network: each node has approximately the
same number of links, for example 2.
(B) Scale-Free Network: a few nodes are very highly
connected.
s
c
i
t
s
i
t
a
t
s
o
Bi
How to construct a gene
co-expression network?
A) Microarray gene expression data
B) Use Pearson correlation to
determine concordance of gene
expressions xi and xj  r(xi,xj)
C) The Pearson correlation matrix is
transformed via an adjacency
function:
• Step function: aij = I r(xi, xj)> τ 
Unweighted network
• Power function: aij = r(xi, xj)β 
Weighted network
s
c
i
t
s
i
t
a
t
s
o
Bi
Two perspectives on scale-free
networks: unweighted and weighted
Unweighted
aij = I r(xi, xj)> τ
Some genes are connected
All connections are equal
Weighted
aij = r(xi, xj)β
All genes are connected
Width of line = strength of k
Hard thresholding ignores connection strength information.
s
c
i
t
s
i
t
a
t
s
o
Bi
Gene (xi) – to – Gene (xj)
relationships in a network
• Adjacency matrix A = network, where each aij entry gives
the connection strength between xi and xj
• Connectivity of gene xi = ki 
xi’ s connection strengths

j
aij row sum of a gene
• Topological overlap between xi and xj = measure of
clustering or shared neighbors. Ravasz et al (2002)
TOM ij 

aiu auj aij
u
min(ki , k j )  1  aij
a
iu auj
Where
is the number of genes
u
connected to both xi and xj (Note: this
TOM definition is for an unweighted
network.)
s
c
i
t
s
i
t
a
t
s
o
Bi
Average Linkage Hierarchical Clustering
Figure I.
Figure II.
(source: http://www.resample.com/xlminer/help/HClst/HClst_intro.htm)
•
•
Agglomerative partitioning (Figure I) to define clusters. Start
with n groups: 1 gene/group, combine until 1 size n group.
Clusters defined using “average linkage” (Figure II) = cluster
with smallest average distance (1 – TOM) is combined.
s
c
i
t
s
i
t
a
t
s
o
Bi
Defining Network Modules
1. Hierarchical clustering
of overlap measures
results in a cluster tree
(dendrogram)
2. Trim the tree at a level that gives a manageable number of genes
and gene clusters (~1,000 genes, 3-10 clusters)
•
•
Gene clusters are called modules
Grey colors indicate genes outside of the modules
s
c
i
t
s
i
t
a
t
s
o
Bi
Network Module Analysis
•
Identify relevant modules according to
one or more of the following strategies:
•
Associate module with trait, SNPs
and/or connectivity data.
•
Annotate module members and primary
functions using gene ontology software.
s
c
i
t
s
i
t
a
t
s
o
Bi
Types of Network Connectivity
Recall: connectivity of a gene i: k i   j a ij
Whole network connectivity
is the sum of connection
strengths (aij) across all
network genes.
Intra-modular connectivity
is the sum of the connection
strengths of gene i within its
module.
Intra-modular connectivity is more biologically meaningful
than whole network connectivity.
s
c
i
t
s
i
t
a
t
s
o
Bi
Applications of WGCNA Part I:
inter-species comparison
1.
Application to human and
chimp brain tissue expression
(2006)
•
•
•
•
2.
Modules that correspond to
brain regions.
Most and least conserved
regions.
Results agreed with known
evolutionary hierarchy.
Identified groups of genes that
could be evolutionary drivers.
Application to two mouse
strains (2007)
•
•
Differential network analysis
between BxH and BxD
Identified pathways and genes
related to weight.
Applications of WGCNA Part II: finding
trait-related pathways and genes
1.
Analysis of endothelial cell (EC)
responses to oxidized lipids (2006)
•
•
2.
Identified 15 pathways
characterizing response
Identified potential gene targets for
atherosclerosis
Integrated analysis of chronic fatigue
syndrome data: microarray, SNP,
traits (2008)
•
•
Tutorial on integrated WGCNA,
compared with standard microarray
analysis
Systems genetics screening criteria
yields genes that are causal for
parent module
WGCNA Software:
stand alone and R package
Jason Aten1,2 and Steve Horvath3
1Biomathematics, 2Human Genetics and 3Biostatistics
Part II: Network Edge
Orienting (NEO)
Undirected
Weighted
Network
Directed
Weighted
Network
s
c
i
t
s
i
t
a
t
s
o
Bi
Motivation for Cause and Effect
Analysis in Genetics
• Large-scale genetic marker and gene expression
data sets can result in numerous genetic candidates
for follow-up studies.
• Many are due to chance rather than a true clinical
relationship.
• Cause and effect analysis can be performed on a
weighted gene co-expression network when genetic
marker data is available, based on the ‘Mendelian
randomization’ concept.
• Such an analysis may:
• Help prioritize among these gene candidates for follow up
analysis.
• Reduce spurious findings.
s
c
i
t
s
i
t
a
t
s
o
Bi
Historical Rationale for Causal
Inference in Genetics (Katan 1986)
1.
DNA variation as measured by genetic markers can only
be causal or have no effect on gene expression and trait
data, it is never reactive
2.
Mendel’s law of independent assortment: genetic traits
are inherited randomly  ‘Mendelian Randomization’
3.
People with a particular DNA variation (X) that conferred
only a small physiological effect are otherwise
comparable to people who have the normal allele (Y)
•
•
The X subjects likely do not know of their particular genetic
difference from the Y subjects, and lead comparable lives
A study of this trait in X and Y adults would be equivalent to a
prospective study that began with X and Y newborns and
followed them through adulthood to see which developed the
trait
s
c
i
t
s
i
t
a
t
s
o
Bi
How to infer causal relationships?
•
Katan (1986): described how causal analysis in observational studies
on APOE gene (M) could determine whether there is a link between
cholesterol (A) and cancer (B)
•
Based on research findings
• APOE alleles influenced cholesterol levels
• Observational studies that low cholesterol was associated with cancer
•
Three possible relationships:
1. M
A
B,
2. M
A
Confounder
2 = 3. M
•
A
|r(M,B)| > 0
B,
B
|r(M,B)| = 0
Correlation information can distinguish relationship 1 from 2 and 3.
But, in practice true causality is
difficult to establish.
• r(M,B) = 0 is unlikely particularly in large data sets or if B is a
quantitative trait
• M  A : may be verified if SNP and gene expression
correspond to the same gene
• Often not possible: it is expensive to have high coverage of
genes with both SNP markers and gene expression profiles.
• Confounded by other markers in linkage disequilibrium with study
marker(s)
• Relationships could be confounded by
• Gene or environment interactions
• Population stratification
• Causality inferred by genetic associations is best considered
probable causality
s
c
i
t
s
i
t
a
t
s
o
Bi
Network Edge Orienting Software (NEO)
• Developed by Jason Aten and Steve Horvath (2008) for
estimating edge orientations in a gene co-expression
network
• Methods based on structural equation modeling (SEM)
• First conceived of by geneticst Sewall Wright (1921)
• Allows study of causal graphs in the context of statistical
distributions
• Each variable in a graph is modeled by combinations of 1 or
more other variables using linear regression
• NEO calculates Local Edge Orienting (LEO) scores
• Based on the relative probabilities of local structural equation
models – models including only 3 nodes
• Higher scores indicate stronger evidence for a causal
relationship
s
c
i
t
s
i
t
a
t
s
o
Bi
NEO software: Input
1. A set of quantitative variables (traits)
•
•
•
Physiological traits
Gene expression data
Typically input both
2. SNP marker data (or other genetic marker
data)
s
c
i
t
s
i
t
a
t
s
o
Bi
Unoriented Network Example
Chr1
Chr2
…
ChrX
Key:
= marker
E1, …,E4 = gene
expressions
E1
E3
E2
HDL, Insulin =
clinical traits
HDL
Insulin
1. Note that if the
transcript
corresponding to
a SNP is known,
the orientation of
the edge is
known
E4
2. Edges between traits and gene
expressions are not yet oriented
s
c
i
t
s
i
t
a
t
s
o
Bi
Network Edges Oriented
Chr1
Chr2
E1
...
Chr22
ChrX
LEO=1.5
LEO=3.5
HDL
E3
E2
LEO=0.8
LEO=0.5
Insulin
Edges are directed.
A score, which
measures the
strength of
evidence for this
direction, is
assigned to each
directed edge
E4
s
c
i
t
s
i
t
a
t
s
o
Bi
NEO software: Output
1. Diagram of the directed network
2. Spreadsheet that summarizes LEO scores and provides hyperlinks
to model fits (html files)
There are 5 models
for a marker M and
traits A and B
In the table below “r” refers to
correlation, the value “1”
indicates r > 0, while the 0
indicates r = 0.
Relationship
r(A, B) r(M, A) r(M, B) r(M, A | B) r(M, B | A)
1. M → A → B
1
1
1
1
0
2. M → B → A
1
1
1
0
1
3. A ← M → B
1
1
1
1
1
4. M → A ← B*
1
1
0
1
1
5. M → B ← A*
1
0
1
1
1
*Note that models 4 and 5 are equivalent to the confounded model:
M → X ← Counfounder → Y.
r(A, B | M)
1
1
0
1
1
Scores from NEO Software
1. Scores for model selection:
•
•
Model p-values
Local edge orienting score (LEO.NB.SingleMarker)
2. Traditional SEM measures for assessing
model fit:
•
•
•
Root Mean Square Error of Approximation
(RMSEA)
Comparative Fit Index (CFI)
Standardized Root Mean Square Residual
(SRMSR)
s
c
i
t
s
i
t
a
t
s
o
Bi
Model P-values
• H0: correlation = 0, H1: |correlation| > 0
• Correlations close to zero = H0 cannot be rejected,
it’s possible that the data fits the null distribution.
• Larger p-values = better model fit. P-value > 0.05 is
considered to indicate good fit.
• Steps for calculating a model p-value:
1. Correlation between a pair of nodes (r) is transformed to a
Z-score using Fisher’s Z transformation:
2. The corresponding p-value for this score can be obtained
from a standard normal distribution table.
s
c
i
t
s
i
t
a
t
s
o
Bi
LEO Score = Relative Model Fit
• Compares p-value of model A  B with next best p-value.
• LEO score > 1 indicates possible causal model.
• Implies model p-value of causal model is 101 = 10 fold higher than
best competing model.
s
c
i
t
s
i
t
a
t
s
o
Bi
SEM Measures for Assessing
Model Fit
• Compare observed Sm and expected Σ
covariance matrices.
• Σ consists of path coefficients among traits
and genetic markers.
• Recommended thresholds for assessing
likely causality:
• RMSEA ≤ 0.05
• CFI ≥ 0.90
• SRMSR ≤ 0.10
s
c
i
t
s
i
t
a
t
s
o
Bi
Multi-marker Models
• NEO analysis has been
generalized to multiple
markers.
• Two LEO scores per model
rather than one.
Common pleiotropic anchor
(CPA) > 0.8
• Orthogonal candidate anchor
(OCA) > 0.3
•
s
c
i
t
s
i
t
a
t
s
o
Bi
4 Multi-Marker Models
s
c
i
t
s
i
t
a
t
s
o
Bi
Multi-Marker NEO can Perform
Marker Selection
• Methods
• Selecting markers with the
best correlation
• Forward-stepwise multivariate
regression approach
• Combination
• The OCA and CPA scores are
computed at each SNP
selection step and should be
robust to the number of SNPs
selected
s
c
i
t
s
i
t
a
t
s
o
Bi
Multi-Marker Simulation Test
• Simulated a causal network consisting of the
following nodes:
• 5 gene expressions (E1-E5)
• Each gene expression controlled by 3 SNPs (18 correct
SNPs)
E1 → E2
• 82 Noise SNPs
E1 → E3
• Trait
E3 ← HiddenConfounder → E4
• Confounder
E4 → Trait
Trait → E5
• Can NEO retrieve the correct SNPs and the correct
edge orientations?
s
c
i
t
s
i
t
a
t
s
o
Bi
Simulation Results
• A red or orange
square in position
(i,j) indicates that a
trait in row i
causally affects the
corresponding trait
from column j.
• NEO successfully
reproduced the
simulated
orientations.
• All 18 SNPs were
identified.
NEO and WGCNA Software
Available Online
• R software, tutorials, and simulated and
real data sets for NEO and WGCNA can
be found online:
•
www.genetics.ucla.edu/labs/horvath/aten/NEO/
•
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
• Google search
• weighted co-expression network
• “WGCNA”
• “co-expression network”
s
c
i
t
s
i
t
a
t
s
o
Bi
Summary: WGCNA & NEO
• WGCNA is a systems genetics approach that is useful for
complex disease analysis
• Genetic signal is weak for individual genes, problematic for
traditional DNA-level analyses
• RNA level data analysis may identify clusters of genes
corresponding to trait-related pathways
• Helps alleviate multiple testing problem
• Focusing on clusters of genes rather than individual genes
improves information quality from microarray data
• WGCNA is also useful for inter-species comparison of gene
expression levels
• NEO can estimate edge orientation in a weighted gene coexpression network if relevant genetic marker data is available
• NEO can also perform marker selection
s
c
i
t
s
i
t
a
t
s
o
Bi
Key References:
Acknowledgements
• WGCNA developed by Bin Zhang and Steve Horvath
• NEO developed by Jason Aten and Steve Horvath
• Lab members: Peter Langfelder, Jun Dong, Tova
Fuller, Ai Li, Wen Lin, Wei Zhao
• Collaborators: Jake Lusis, Tom Drake, Anatole
Ghazalpour
s
c
i
t
s
i
t
a
t
s
o
Bi