Download Weighted gene co-expression network analysis (WGCNA) and

Weighted gene co-expression network analysis (WGCNA) and network edge orienting (NEO) Part I: WGCNA Part II: NEO Bin Zhang and Steve Horvath University of California, Los Angeles, USA Departments of Human Genetics and Biostatistics s c i t s i t a t s o Bi Challenges of Modern Genetics 1. Genetic analysis of complex diseases is difficult • • • Requires searching for many small effect genes Difficult to detect signal at the DNA level RNA level day may identify clusters of genes 2. Microarray technology – measures RNA levels (gene expression) • • But this data is noisy! Focusing on single genes can lead to spurious results due to outliers or array artifacts  Network analysis of RNA data: “Gene Co-expression Network Analysis” (GCNA) s c i t s i t a t s o Bi Scale-free Networks: Derek J de Solla Price • Derek J de Solla Price was a professor of applied mathematics at Raffles College which became part of the University of Singapore in 1948. • Singapore = great location for a systems biology conference! • In 1965 he published the first example of a scale-free network. • The network of scientific journal articles has connections (citations) that follow a power-law distribution. s c i t s i t a t s o Bi Timeline for Scale-Free Gene CoExpression Networks 1965: Concept first conceived by Derek J. de Solla Price 1999: Resurrected by Barabasi and Albert by discovering its applicability for modeling the internet and biological networks. 2000: The concept of modeling gene expression data as a network was introduced by Butte and Kohane. 2002: Featherstone and Broadie showed that these networks exhibited scale-free topology. Gene Co-Expression Network Analysis (GCNA) = Systems Genetics Approach • Goal is to understand the “system” instead of reporting a list of individual parts • Focus on gene clusters: “modules” rather than individual genes • Easily integrated with other types of data: genetic marker and protein data, clinical traits • Network structure translates to biological pathways (can be confirmed and annotated using gene ontology software) s c i t s i t a t s o Bi GCNA addresses issues in microarray data & complex disease genetics • Individual gene expressions may be poorly measured, so it is safer to study this data at the module level. • Modules are likely to represent pathways – genes which are co- regulated and/or interact. • The signal from these pathways tends to be stronger than the signal from a single gene. • Alleviates multiple testing problem in traditional association/differential expression analyses. s c i t s i t a t s o Bi Network Terminology Definitions: Node = objects (ex. Genes) Connection = link between 2 nodes k = Degree(Nodei) = # of links to Nodei Pr(k) = probability Nodei has k links. e   Pr( k )  k! k Pr(k ) k  Barabási AL, Oltvai ZN (2004). Network biology: Understanding the cell's functional organization. Nature reviews genetics, 5, 101-113. (A) Random Network: each node has approximately the same number of links, for example 2. (B) Scale-Free Network: a few nodes are very highly connected. s c i t s i t a t s o Bi How to construct a gene co-expression network? A) Microarray gene expression data B) Use Pearson correlation to determine concordance of gene expressions xi and xj  r(xi,xj) C) The Pearson correlation matrix is transformed via an adjacency function: • Step function: aij = I r(xi, xj)> τ  Unweighted network • Power function: aij = r(xi, xj)β  Weighted network s c i t s i t a t s o Bi Two perspectives on scale-free networks: unweighted and weighted Unweighted aij = I r(xi, xj)> τ Some genes are connected All connections are equal Weighted aij = r(xi, xj)β All genes are connected Width of line = strength of k Hard thresholding ignores connection strength information. s c i t s i t a t s o Bi Gene (xi) – to – Gene (xj) relationships in a network • Adjacency matrix A = network, where each aij entry gives the connection strength between xi and xj • Connectivity of gene xi = ki  xi’ s connection strengths  j aij row sum of a gene • Topological overlap between xi and xj = measure of clustering or shared neighbors. Ravasz et al (2002) TOM ij   aiu auj aij u min(ki , k j )  1  aij a iu auj Where is the number of genes u connected to both xi and xj (Note: this TOM definition is for an unweighted network.) s c i t s i t a t s o Bi Average Linkage Hierarchical Clustering Figure I. Figure II. (source: http://www.resample.com/xlminer/help/HClst/HClst_intro.htm) • • Agglomerative partitioning (Figure I) to define clusters. Start with n groups: 1 gene/group, combine until 1 size n group. Clusters defined using “average linkage” (Figure II) = cluster with smallest average distance (1 – TOM) is combined. s c i t s i t a t s o Bi Defining Network Modules 1. Hierarchical clustering of overlap measures results in a cluster tree (dendrogram) 2. Trim the tree at a level that gives a manageable number of genes and gene clusters (~1,000 genes, 3-10 clusters) • • Gene clusters are called modules Grey colors indicate genes outside of the modules s c i t s i t a t s o Bi Network Module Analysis • Identify relevant modules according to one or more of the following strategies: • Associate module with trait, SNPs and/or connectivity data. • Annotate module members and primary functions using gene ontology software. s c i t s i t a t s o Bi Types of Network Connectivity Recall: connectivity of a gene i: k i   j a ij Whole network connectivity is the sum of connection strengths (aij) across all network genes. Intra-modular connectivity is the sum of the connection strengths of gene i within its module. Intra-modular connectivity is more biologically meaningful than whole network connectivity. s c i t s i t a t s o Bi Applications of WGCNA Part I: inter-species comparison 1. Application to human and chimp brain tissue expression (2006) • • • • 2. Modules that correspond to brain regions. Most and least conserved regions. Results agreed with known evolutionary hierarchy. Identified groups of genes that could be evolutionary drivers. Application to two mouse strains (2007) • • Differential network analysis between BxH and BxD Identified pathways and genes related to weight. Applications of WGCNA Part II: finding trait-related pathways and genes 1. Analysis of endothelial cell (EC) responses to oxidized lipids (2006) • • 2. Identified 15 pathways characterizing response Identified potential gene targets for atherosclerosis Integrated analysis of chronic fatigue syndrome data: microarray, SNP, traits (2008) • • Tutorial on integrated WGCNA, compared with standard microarray analysis Systems genetics screening criteria yields genes that are causal for parent module WGCNA Software: stand alone and R package Jason Aten1,2 and Steve Horvath3 1Biomathematics, 2Human Genetics and 3Biostatistics Part II: Network Edge Orienting (NEO) Undirected Weighted Network Directed Weighted Network s c i t s i t a t s o Bi Motivation for Cause and Effect Analysis in Genetics • Large-scale genetic marker and gene expression data sets can result in numerous genetic candidates for follow-up studies. • Many are due to chance rather than a true clinical relationship. • Cause and effect analysis can be performed on a weighted gene co-expression network when genetic marker data is available, based on the ‘Mendelian randomization’ concept. • Such an analysis may: • Help prioritize among these gene candidates for follow up analysis. • Reduce spurious findings. s c i t s i t a t s o Bi Historical Rationale for Causal Inference in Genetics (Katan 1986) 1. DNA variation as measured by genetic markers can only be causal or have no effect on gene expression and trait data, it is never reactive 2. Mendel’s law of independent assortment: genetic traits are inherited randomly  ‘Mendelian Randomization’ 3. People with a particular DNA variation (X) that conferred only a small physiological effect are otherwise comparable to people who have the normal allele (Y) • • The X subjects likely do not know of their particular genetic difference from the Y subjects, and lead comparable lives A study of this trait in X and Y adults would be equivalent to a prospective study that began with X and Y newborns and followed them through adulthood to see which developed the trait s c i t s i t a t s o Bi How to infer causal relationships? • Katan (1986): described how causal analysis in observational studies on APOE gene (M) could determine whether there is a link between cholesterol (A) and cancer (B) • Based on research findings • APOE alleles influenced cholesterol levels • Observational studies that low cholesterol was associated with cancer • Three possible relationships: 1. M A B, 2. M A Confounder 2 = 3. M • A |r(M,B)| > 0 B, B |r(M,B)| = 0 Correlation information can distinguish relationship 1 from 2 and 3. But, in practice true causality is difficult to establish. • r(M,B) = 0 is unlikely particularly in large data sets or if B is a quantitative trait • M  A : may be verified if SNP and gene expression correspond to the same gene • Often not possible: it is expensive to have high coverage of genes with both SNP markers and gene expression profiles. • Confounded by other markers in linkage disequilibrium with study marker(s) • Relationships could be confounded by • Gene or environment interactions • Population stratification • Causality inferred by genetic associations is best considered probable causality s c i t s i t a t s o Bi Network Edge Orienting Software (NEO) • Developed by Jason Aten and Steve Horvath (2008) for estimating edge orientations in a gene co-expression network • Methods based on structural equation modeling (SEM) • First conceived of by geneticst Sewall Wright (1921) • Allows study of causal graphs in the context of statistical distributions • Each variable in a graph is modeled by combinations of 1 or more other variables using linear regression • NEO calculates Local Edge Orienting (LEO) scores • Based on the relative probabilities of local structural equation models – models including only 3 nodes • Higher scores indicate stronger evidence for a causal relationship s c i t s i t a t s o Bi NEO software: Input 1. A set of quantitative variables (traits) • • • Physiological traits Gene expression data Typically input both 2. SNP marker data (or other genetic marker data) s c i t s i t a t s o Bi Unoriented Network Example Chr1 Chr2 … ChrX Key: = marker E1, …,E4 = gene expressions E1 E3 E2 HDL, Insulin = clinical traits HDL Insulin 1. Note that if the transcript corresponding to a SNP is known, the orientation of the edge is known E4 2. Edges between traits and gene expressions are not yet oriented s c i t s i t a t s o Bi Network Edges Oriented Chr1 Chr2 E1 ... Chr22 ChrX LEO=1.5 LEO=3.5 HDL E3 E2 LEO=0.8 LEO=0.5 Insulin Edges are directed. A score, which measures the strength of evidence for this direction, is assigned to each directed edge E4 s c i t s i t a t s o Bi NEO software: Output 1. Diagram of the directed network 2. Spreadsheet that summarizes LEO scores and provides hyperlinks to model fits (html files) There are 5 models for a marker M and traits A and B In the table below “r” refers to correlation, the value “1” indicates r > 0, while the 0 indicates r = 0. Relationship r(A, B) r(M, A) r(M, B) r(M, A | B) r(M, B | A) 1. M → A → B 1 1 1 1 0 2. M → B → A 1 1 1 0 1 3. A ← M → B 1 1 1 1 1 4. M → A ← B* 1 1 0 1 1 5. M → B ← A* 1 0 1 1 1 *Note that models 4 and 5 are equivalent to the confounded model: M → X ← Counfounder → Y. r(A, B | M) 1 1 0 1 1 Scores from NEO Software 1. Scores for model selection: • • Model p-values Local edge orienting score (LEO.NB.SingleMarker) 2. Traditional SEM measures for assessing model fit: • • • Root Mean Square Error of Approximation (RMSEA) Comparative Fit Index (CFI) Standardized Root Mean Square Residual (SRMSR) s c i t s i t a t s o Bi Model P-values • H0: correlation = 0, H1: |correlation| > 0 • Correlations close to zero = H0 cannot be rejected, it’s possible that the data fits the null distribution. • Larger p-values = better model fit. P-value > 0.05 is considered to indicate good fit. • Steps for calculating a model p-value: 1. Correlation between a pair of nodes (r) is transformed to a Z-score using Fisher’s Z transformation: 2. The corresponding p-value for this score can be obtained from a standard normal distribution table. s c i t s i t a t s o Bi LEO Score = Relative Model Fit • Compares p-value of model A  B with next best p-value. • LEO score > 1 indicates possible causal model. • Implies model p-value of causal model is 101 = 10 fold higher than best competing model. s c i t s i t a t s o Bi SEM Measures for Assessing Model Fit • Compare observed Sm and expected Σ covariance matrices. • Σ consists of path coefficients among traits and genetic markers. • Recommended thresholds for assessing likely causality: • RMSEA ≤ 0.05 • CFI ≥ 0.90 • SRMSR ≤ 0.10 s c i t s i t a t s o Bi Multi-marker Models • NEO analysis has been generalized to multiple markers. • Two LEO scores per model rather than one. Common pleiotropic anchor (CPA) > 0.8 • Orthogonal candidate anchor (OCA) > 0.3 • s c i t s i t a t s o Bi 4 Multi-Marker Models s c i t s i t a t s o Bi Multi-Marker NEO can Perform Marker Selection • Methods • Selecting markers with the best correlation • Forward-stepwise multivariate regression approach • Combination • The OCA and CPA scores are computed at each SNP selection step and should be robust to the number of SNPs selected s c i t s i t a t s o Bi Multi-Marker Simulation Test • Simulated a causal network consisting of the following nodes: • 5 gene expressions (E1-E5) • Each gene expression controlled by 3 SNPs (18 correct SNPs) E1 → E2 • 82 Noise SNPs E1 → E3 • Trait E3 ← HiddenConfounder → E4 • Confounder E4 → Trait Trait → E5 • Can NEO retrieve the correct SNPs and the correct edge orientations? s c i t s i t a t s o Bi Simulation Results • A red or orange square in position (i,j) indicates that a trait in row i causally affects the corresponding trait from column j. • NEO successfully reproduced the simulated orientations. • All 18 SNPs were identified. NEO and WGCNA Software Available Online • R software, tutorials, and simulated and real data sets for NEO and WGCNA can be found online: • www.genetics.ucla.edu/labs/horvath/aten/NEO/ • http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork • Google search • weighted co-expression network • “WGCNA” • “co-expression network” s c i t s i t a t s o Bi Summary: WGCNA & NEO • WGCNA is a systems genetics approach that is useful for complex disease analysis • Genetic signal is weak for individual genes, problematic for traditional DNA-level analyses • RNA level data analysis may identify clusters of genes corresponding to trait-related pathways • Helps alleviate multiple testing problem • Focusing on clusters of genes rather than individual genes improves information quality from microarray data • WGCNA is also useful for inter-species comparison of gene expression levels • NEO can estimate edge orientation in a weighted gene coexpression network if relevant genetic marker data is available • NEO can also perform marker selection s c i t s i t a t s o Bi Key References: Acknowledgements • WGCNA developed by Bin Zhang and Steve Horvath • NEO developed by Jason Aten and Steve Horvath • Lab members: Peter Langfelder, Jun Dong, Tova Fuller, Ai Li, Wen Lin, Wei Zhao • Collaborators: Jake Lusis, Tom Drake, Anatole Ghazalpour s c i t s i t a t s o Bi

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Weighted gene co-expression network analysis (WGCNA) and