Download Gene Co-expression Networks: Functional Organization of

Document related concepts

X-inactivation wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome evolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

The Selfish Gene wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
GeneQuery:
A phenotype search tool
based on gene co-expression clustering
Alexander Predeus
21-oct-2015
About myself
•
•
•
•
Graduated from Moscow state University (1998-2003)
PhD: Michigan State University (2003-2009): organometallic chemistry
Post-doc #1, MSU: quantitative biology (molecular dynamics)
Post-doc #2, Wash U: next-generation sequencing, systems biology
Outline
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
Outline
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
Clustering
•
•
•
The goal of any clustering is to group objects by similarity
Thus clustering reveals the inner structure of the data
Two questions arise:
– how do you measure similarity?
– how do you use that measure to find the groups?
Gene clustering
•
•
•
•
For gene expression you can cluster
individual samples (columns)
Samples from same conditions are
expected to cluster together (if not batch effect!)
You can also cluster genes (rows)
Genes that are regulated in the same
pathway tend to be co-expressed
Ways to measure distance & cluster
•
metrics can vary
depending on
the goal
• Euclidean
distance or
correlation are
commonly used
• E.d. is sensitive
to scaling and
average
expression,
correlation is not
a. groups
b. hierarchical
c. k-means
d. SOM
Real-life datasets
•
•
•
… are hard to cluster, because they are messy
outlier samples and genes are most often the problem
cluster shape is also important
Outline
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
Mjölnir
•
Mjölnir is Thor’s hammer, that cannot be lifted by anyone who is not a god
An astronomer’s perspective
•
it’s not necessarily magical, could just be very heavy
WGCNA algorithm 1
•
•
•
•
Stands for Weighted Correlation Network Analysis
Uses a type of Pearson correlation-based metric
As the name suggests, tightly related to network analysis paradigm, more
concretely to a concept of a scale-free network
If we have m samples, and xj is the vector of expression values of length m,
and we are comparing genes i and j,
similarity:
unweighted
adjacency
if τ (cutoff correlation)
is set to 0.8, no genes
with correlation of 0.799
or below would be
considered adjacent
WGCNA algorithm 2
•
•
•
Microarrays are noisy!
β is would be > 1, usually 6-20 in real applications
This approach is called soft thresholding
•
Gene significance can range from 0 to 1 and is
defined via clinical trait, that can be quantitative
(body weight) or qualitative (treatment vs control)
•
By calculating all aij we have constructed an n x n
adjacency matrix, where n in the number of genes
Soft thresholding illustrated
•
As the power beta increases, adjacency of lowly correlated
genes becomes negligible
What does it have to do with a network?
•
•
if genes are adjacent, that could be represented as a connection
if not, then there is no connection.
Adjacency matrix
Network
Why bring the network into this at all?
•
•
Scale-free network is defined as one for which the probability of a node
having k connections decays as a power law: p(k) = k-γ
Scale-free topology is a philosophical phenomenon
WGCNA algorithm 3
•
•
So we want our network scale-free; how do we achieve it?
First, we calculate connectivities:
•
Then we simply change β in the range from 1 to 20, and calculate
p(k) for each gene , and see how linear the log(p(k)) - log(k) plot is
(as measured by R-squared)
We want the fit to be very close to linear, because scale-free network
is p(k) = k-γ
•
WGCNA algorithm 4
•
•
•
So, we chose the β that gives us at least 0.8 R-squared, i.e. constructed a
network. What now?
We identify modules using TOM - topological overlap measure
When two genes (nodes) connect to the same large group of other nodes,
they have high topological overlap
WGCNA algorithm 5
•
•
•
TOMs thus can be represented as a matrix with same
dimensionality as adjacency matrix - n x n
TOM-based dissimilarity measure will thus be
Using this dissimilarity measure as a metric, we perform
hierarchical clustering
Final touch: Dynamic and hybrid dynamic tree cut
•
Notice the presence of “null-module” which is de-facto genes rejected from all
Eigengenes
•
•
Eigengene is a first principal component of the module
Eigengene expression can be used as a measure of module expression
change across the samples
Outline
•
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
The wonderful story of ciclopirox
Hypergeometric probability and gene sets
•
•
•
•
•
protein-coding gene repertoire of ~ 20k genes
signaling pathway X containing 100 genes
DE: 200 genes are up-regulated
How many genes from pathway X would be included in our 200 simply by
chance?
What is the p-value of having 50 or more?
Use
•
•
mathematically identical to drawing without replacement model
The exact solution is known as Fisher’s exact test:
We want right-sided p-value
MsigDB
•
MsigDB is similar but takes a limited gene signature, and returns standard
gene signatures ranked by FDR
What about broader scope?
•
We wanted something that can look at all known expression datasets without tedious/impossible manual curation
Outline
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
Proposed tool
Input:
Differential
Gene
Expression
(RNA-Chip,
RNA-Seq,
exome
sequencing)
gene
selection
Reference:
Massive database (GEO)
of expression experiments
Each independently
clustered
clustering
algorythm
Static database
of clustered
expression
matrices
gene
matrices
list
search
algorythm
find overlaps
Clusters
(aka modules)
Test Set 2: Mouse samples
•
•
•
Two separate databases were assembled (Homo Sapiens and Mus Musculus)
Using GEO omnibus statistics, top used platforms were selected
Search performed using the following criteria:
–
–
–
–
–
•
•
•
expression profiling by array
Mus musculus
(See platforms below)
12:100 samples
Years of publication: 2009-3000
2042 results returned, saved as a meta-file, downloaded
1529 preprocessed CSV files obtained after running pre-processing script
1496 have successfully completed clustering and were usable in the database
Platform ID
GPL1261
GPL6246
GPL8321
GPL339
GPL81
GPL7202
GPL4134
GPL6887
GPL6885
Manufacturer
Affymetrix
Affymetrix
Affymetrix
Affymetrix
Affymetrix
Agilent
Agilent
Illumina
Illumina
Type Number of sets
ISO
681
ISO
317
ISO
107
ISO
33
ISO
28
ISO
60
ISO
44
OB
142
OB
84
Affymetrix
Agilent
Illumina
Test Set 2: Human samples
•
•
Using GEO omnibus statistics, top used platforms were selected
Search performed using the following criteria:
–
–
–
–
–
•
•
•
expression profiling by array
Homo Sapiens
(See platforms below)
12:100 samples
Years of publication: 2009-3000
2962 results returned, saved as a meta-file, downloaded
2177 preprocessed CSV files obtained after running pre-processing script
2110 have successfully completed clustering and were usable in the database
Platform ID
GPL570
GPL6244
GPL96
GPL571
GPL8300
GPL4133
GPL6480
GPL10558
GPL6947
GPL6884
GPL6883
Manufacturer
Affymetrix
Affymetrix
Affymetrix
Affymetrix
Affymetrix
Agilent
Agilent
Illumina
Illumina
Illumina
Illumina
Type
ISO
ISO
ISO
ISO
ISO
ISO
ISO
OB
OB
OB
OB
Number of sets
982
291
170
125
17
83
53
149
130
57
53
Affymetrix
Agilent
Illumina
Eigengene Expression
Modules
Normalize
per-module
expression
Do we need to adjust for multiple comparisons?
•
Yes.
Distributions
•
distributions are fairly close to normal, so we use it to adjust the p-value.
Linear regressions for p-values
•
Linear regressions were used to calculate adjusted p-values on the fly from
gene sets of arbitrary size
Database
Average
Standard deviation
mm_2K
-0.02243*x-2.9722
0.001502*x+1.626456
mm_4K
-0.01322*x-3.06942
0.0007694*x+0.9825192
hs_2K
-0.02532*x-1.4273
0.0007609*x+0.9656458
hs_4K
-0.01870*x-2.2151
0.0009385*x+1.0276621
Website features
•
•
•
•
https://artyomovlab.wustl.edu/genequery/ is operational!
waiting time 20s to 1 min, tested with up to 1.5k size queries
human database (~ 5k experiments) and mouse database (~ 3.5k experiments)
are available
you can enter gene list in the form of gene symbols, RefSeq IDs, or Entrez IDs
Output
Example 1: M2 (IL-4 activated) macrophage-specific
genes in mice
Example 2: M1 (LPS activated) macrophage-specific
genes in mice
Outline
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
GEO “Expression Universe”
•
Human network dominated by small clusters, murine - by tissue-specific
large ones
Outline
•
•
•
•
•
•
What is gene clustering?
WGCNA: Mjölnir of clustering
What is gene set enrichment analysis?
How can we find experiments biologically similar to ours?
GEO database universe
Cmap perturbagene database and how it’s useful to us
Connectivity Map (Cmap)
•
•
•
Connectivity map (Cmap) resource is available at
https://www.broadinstitute.org/cmap/
1300 drugs were used to treat 3 human cell lines, resulting
perturbation of gene expression
Allows to connect a given gene signature to an appropriate drug
Cmap vs. GeneQuery
•
•
•
The results were impressive: over 95% (1245 out of 1303) up-regulated and
93% (1219 out of 1303) drugs have overlapped at least one module!
Many matched expected phenotypes, while some matches were unexpected
This shows a good potential for drug repurposing
Digoxin
•
•
Digoxin is a cardiac glycoside (causes heart muscle contraction)
Used to treat heart conditions, like atrial fibrillation and heart failure
Digoxin and GeneQuery
•
•
Distinct overlaps with many modules implying interference with TLR4 but
excluding Nf-kb pathways
Example: up-regulated upon infection in cell line with disabled OspF gene
(necessary for Nf-kb signaling)
Newsflash: digoxin as prospective ALS drug!
•
•
Last month in Wash U “Record” newspaper
Thought to reduce cytokine release via Na/K ATPase inhibition
Reported link between digoxin and Th17
Ciclopirox
•
•
•
Topical antifungal
Known iron chelator, mimics hypoxia via HIF-1a up-regulation
Currently in clinical studies for anti-tumor activity
Cyclopirox and GeneQuery
•
We see overlaps with
– hypoxia
– cancerous tumors
– inflammatory phenotypes
Cyclopirox and GeneQuery
•
We see overlaps with
– hypoxia
– cancerous tumors
– inflammatory phenotypes
Cyclopirox and GeneQuery
•
We see overlaps with
– hypoxia
– cancerous tumors
– inflammatory phenotypes
Testing the hypothesis
•
•
•
Cultures of bone-marrow derived macrophages were treated with cyclopirox
We then compared LPS response of treated and untreated BMDMs
ELISA assays confirmed up-regulation of IL-1b and down-regulation of IL-6
As you can see, it’s all quite simple.
Thank you for your attention!