* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Co-expression Networks: Functional Organization of
X-inactivation wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Copy-number variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene therapy wikipedia , lookup
Helitron (biology) wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
The Selfish Gene wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
GeneQuery: A phenotype search tool based on gene co-expression clustering Alexander Predeus 21-oct-2015 About myself • • • • Graduated from Moscow state University (1998-2003) PhD: Michigan State University (2003-2009): organometallic chemistry Post-doc #1, MSU: quantitative biology (molecular dynamics) Post-doc #2, Wash U: next-generation sequencing, systems biology Outline • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us Outline • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us Clustering • • • The goal of any clustering is to group objects by similarity Thus clustering reveals the inner structure of the data Two questions arise: – how do you measure similarity? – how do you use that measure to find the groups? Gene clustering • • • • For gene expression you can cluster individual samples (columns) Samples from same conditions are expected to cluster together (if not batch effect!) You can also cluster genes (rows) Genes that are regulated in the same pathway tend to be co-expressed Ways to measure distance & cluster • metrics can vary depending on the goal • Euclidean distance or correlation are commonly used • E.d. is sensitive to scaling and average expression, correlation is not a. groups b. hierarchical c. k-means d. SOM Real-life datasets • • • … are hard to cluster, because they are messy outlier samples and genes are most often the problem cluster shape is also important Outline • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us Mjölnir • Mjölnir is Thor’s hammer, that cannot be lifted by anyone who is not a god An astronomer’s perspective • it’s not necessarily magical, could just be very heavy WGCNA algorithm 1 • • • • Stands for Weighted Correlation Network Analysis Uses a type of Pearson correlation-based metric As the name suggests, tightly related to network analysis paradigm, more concretely to a concept of a scale-free network If we have m samples, and xj is the vector of expression values of length m, and we are comparing genes i and j, similarity: unweighted adjacency if τ (cutoff correlation) is set to 0.8, no genes with correlation of 0.799 or below would be considered adjacent WGCNA algorithm 2 • • • Microarrays are noisy! β is would be > 1, usually 6-20 in real applications This approach is called soft thresholding • Gene significance can range from 0 to 1 and is defined via clinical trait, that can be quantitative (body weight) or qualitative (treatment vs control) • By calculating all aij we have constructed an n x n adjacency matrix, where n in the number of genes Soft thresholding illustrated • As the power beta increases, adjacency of lowly correlated genes becomes negligible What does it have to do with a network? • • if genes are adjacent, that could be represented as a connection if not, then there is no connection. Adjacency matrix Network Why bring the network into this at all? • • Scale-free network is defined as one for which the probability of a node having k connections decays as a power law: p(k) = k-γ Scale-free topology is a philosophical phenomenon WGCNA algorithm 3 • • So we want our network scale-free; how do we achieve it? First, we calculate connectivities: • Then we simply change β in the range from 1 to 20, and calculate p(k) for each gene , and see how linear the log(p(k)) - log(k) plot is (as measured by R-squared) We want the fit to be very close to linear, because scale-free network is p(k) = k-γ • WGCNA algorithm 4 • • • So, we chose the β that gives us at least 0.8 R-squared, i.e. constructed a network. What now? We identify modules using TOM - topological overlap measure When two genes (nodes) connect to the same large group of other nodes, they have high topological overlap WGCNA algorithm 5 • • • TOMs thus can be represented as a matrix with same dimensionality as adjacency matrix - n x n TOM-based dissimilarity measure will thus be Using this dissimilarity measure as a metric, we perform hierarchical clustering Final touch: Dynamic and hybrid dynamic tree cut • Notice the presence of “null-module” which is de-facto genes rejected from all Eigengenes • • Eigengene is a first principal component of the module Eigengene expression can be used as a measure of module expression change across the samples Outline • • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us The wonderful story of ciclopirox Hypergeometric probability and gene sets • • • • • protein-coding gene repertoire of ~ 20k genes signaling pathway X containing 100 genes DE: 200 genes are up-regulated How many genes from pathway X would be included in our 200 simply by chance? What is the p-value of having 50 or more? Use • • mathematically identical to drawing without replacement model The exact solution is known as Fisher’s exact test: We want right-sided p-value MsigDB • MsigDB is similar but takes a limited gene signature, and returns standard gene signatures ranked by FDR What about broader scope? • We wanted something that can look at all known expression datasets without tedious/impossible manual curation Outline • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us Proposed tool Input: Differential Gene Expression (RNA-Chip, RNA-Seq, exome sequencing) gene selection Reference: Massive database (GEO) of expression experiments Each independently clustered clustering algorythm Static database of clustered expression matrices gene matrices list search algorythm find overlaps Clusters (aka modules) Test Set 2: Mouse samples • • • Two separate databases were assembled (Homo Sapiens and Mus Musculus) Using GEO omnibus statistics, top used platforms were selected Search performed using the following criteria: – – – – – • • • expression profiling by array Mus musculus (See platforms below) 12:100 samples Years of publication: 2009-3000 2042 results returned, saved as a meta-file, downloaded 1529 preprocessed CSV files obtained after running pre-processing script 1496 have successfully completed clustering and were usable in the database Platform ID GPL1261 GPL6246 GPL8321 GPL339 GPL81 GPL7202 GPL4134 GPL6887 GPL6885 Manufacturer Affymetrix Affymetrix Affymetrix Affymetrix Affymetrix Agilent Agilent Illumina Illumina Type Number of sets ISO 681 ISO 317 ISO 107 ISO 33 ISO 28 ISO 60 ISO 44 OB 142 OB 84 Affymetrix Agilent Illumina Test Set 2: Human samples • • Using GEO omnibus statistics, top used platforms were selected Search performed using the following criteria: – – – – – • • • expression profiling by array Homo Sapiens (See platforms below) 12:100 samples Years of publication: 2009-3000 2962 results returned, saved as a meta-file, downloaded 2177 preprocessed CSV files obtained after running pre-processing script 2110 have successfully completed clustering and were usable in the database Platform ID GPL570 GPL6244 GPL96 GPL571 GPL8300 GPL4133 GPL6480 GPL10558 GPL6947 GPL6884 GPL6883 Manufacturer Affymetrix Affymetrix Affymetrix Affymetrix Affymetrix Agilent Agilent Illumina Illumina Illumina Illumina Type ISO ISO ISO ISO ISO ISO ISO OB OB OB OB Number of sets 982 291 170 125 17 83 53 149 130 57 53 Affymetrix Agilent Illumina Eigengene Expression Modules Normalize per-module expression Do we need to adjust for multiple comparisons? • Yes. Distributions • distributions are fairly close to normal, so we use it to adjust the p-value. Linear regressions for p-values • Linear regressions were used to calculate adjusted p-values on the fly from gene sets of arbitrary size Database Average Standard deviation mm_2K -0.02243*x-2.9722 0.001502*x+1.626456 mm_4K -0.01322*x-3.06942 0.0007694*x+0.9825192 hs_2K -0.02532*x-1.4273 0.0007609*x+0.9656458 hs_4K -0.01870*x-2.2151 0.0009385*x+1.0276621 Website features • • • • https://artyomovlab.wustl.edu/genequery/ is operational! waiting time 20s to 1 min, tested with up to 1.5k size queries human database (~ 5k experiments) and mouse database (~ 3.5k experiments) are available you can enter gene list in the form of gene symbols, RefSeq IDs, or Entrez IDs Output Example 1: M2 (IL-4 activated) macrophage-specific genes in mice Example 2: M1 (LPS activated) macrophage-specific genes in mice Outline • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us GEO “Expression Universe” • Human network dominated by small clusters, murine - by tissue-specific large ones Outline • • • • • • What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it’s useful to us Connectivity Map (Cmap) • • • Connectivity map (Cmap) resource is available at https://www.broadinstitute.org/cmap/ 1300 drugs were used to treat 3 human cell lines, resulting perturbation of gene expression Allows to connect a given gene signature to an appropriate drug Cmap vs. GeneQuery • • • The results were impressive: over 95% (1245 out of 1303) up-regulated and 93% (1219 out of 1303) drugs have overlapped at least one module! Many matched expected phenotypes, while some matches were unexpected This shows a good potential for drug repurposing Digoxin • • Digoxin is a cardiac glycoside (causes heart muscle contraction) Used to treat heart conditions, like atrial fibrillation and heart failure Digoxin and GeneQuery • • Distinct overlaps with many modules implying interference with TLR4 but excluding Nf-kb pathways Example: up-regulated upon infection in cell line with disabled OspF gene (necessary for Nf-kb signaling) Newsflash: digoxin as prospective ALS drug! • • Last month in Wash U “Record” newspaper Thought to reduce cytokine release via Na/K ATPase inhibition Reported link between digoxin and Th17 Ciclopirox • • • Topical antifungal Known iron chelator, mimics hypoxia via HIF-1a up-regulation Currently in clinical studies for anti-tumor activity Cyclopirox and GeneQuery • We see overlaps with – hypoxia – cancerous tumors – inflammatory phenotypes Cyclopirox and GeneQuery • We see overlaps with – hypoxia – cancerous tumors – inflammatory phenotypes Cyclopirox and GeneQuery • We see overlaps with – hypoxia – cancerous tumors – inflammatory phenotypes Testing the hypothesis • • • Cultures of bone-marrow derived macrophages were treated with cyclopirox We then compared LPS response of treated and untreated BMDMs ELISA assays confirmed up-regulation of IL-1b and down-regulation of IL-6 As you can see, it’s all quite simple. Thank you for your attention!