Download Downstream analysis of transcriptomic data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Oncogenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

NEDD9 wikipedia , lookup

Ridge (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

The Selfish Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Downstream analysis of transcriptomic data Shamith Samarajiwa CRUK Bioinforma3cs Summer School July 2015 General Methods •  Dimensionality reduc3on methods (clustering, PCA, MDS) •  Visualizing PaKerns (heatmaps, dendrograms) •  Over representa3on Analysis (ORA) Ontology enrichment Gene Set Enrichment analysis (GSEA/GSA) Pathway enrichment analysis •  Network biology methods •  Promoter analysis of co-­‐regulated genes •  Gene/Protein interac3on analysis •  Biomedical Bibliomics The curse of dimensionality •  Curse of dimensionality (Bellman 1961) phenomena that arise when analyzing and organizing data in high-­‐dimensional spaces that do not occur in low-­‐dimensional seXngs. •  Caused by number of features > number of samples •  high dimensional (d): hundreds or thousands of dimensions. •  gene expression: d ∼ 104 − 105 •  SNP data: d ∼ 106 A common theme of these situa3ons is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problema3c for any method that requires sta3s3cal significance. Dimensionality Reduc3on •  Dimensionality reduc3on techniques are a common approach for dealing with noisy, high-­‐dimensional data. •  These unsupervised methods can help uncover interes3ng structure in complex datasets. •  However they give very liKle insight into the biological and technical aspects that might explain the uncovered structure. Principal Component Analysis (PCA) •  Principal component analysis (PCA) reduces the dimensionality of the data while retaining most of the variance in the data set. •  It accomplishes this reduc3on by iden3fying direc3ons (Eigenvectors + Eigenvalues), called principal components, along which the variance in the data is maximal. By using a few components, each sample can be represented by a rela3vely few numbers instead of by values for thousands of variables. Data can then be visualized, making it possible to assess similari3es and differences between samples and determine whether samples are grouped together. Mul3-­‐dimensional Scaling (MDS) •  Mul7dimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordina3on techniques used in informa3on visualiza3on, in par3cular to display the informa3on contained in a distance matrix. Clustering •  The process of grouping together similar en33es. Need to define similarity: distance metric •  Hierarchical clustering •  K-­‐means •  SOFM •  PAM •  Biclustering Clustering •  Given enough genes, a set of genes will always cluster. •  Clusters produced by a given algorithm is highly dependent on the distance metric used. •  Same clustering method applied too the same data may produce different results. Visualizing complex datasets with heatmaps •  There are mul3ple R packages and func3ons that can generate heatmaps. i. 
ii. 
iii. 
iv. 
Heatplus Pheatmap Heatmap.2{gplots} Heatmap.plus Correla3on Heatmap Over Representa3onal Analysis (ORA) •  Given a list of genes/features and one or more lists of annota3ons, are any of he annota3ons surprisingly enriched in the gene list? •  How to assess “surprisingly”? -­‐Sta3s3cs •  How to correct for repeated tes3ng ?-­‐ False Discovery Rate correc3on •  Test for under and/or over enrichment rela3ve to a background popula3on; selec3ng the right “background” is important!! i.  Hypergeometric test ii.  Fisher’s exact test iii.  Two tailed t-­‐test (sig. difference between means of two distribu3ons) iv.  Kolmogorov–Smirnov test (test of the equality of con3nuous, one-­‐
dimensional probability distribu3ons that can be used to compare a sample with a reference probability distribu3on) Fisher’s Exact Test •  Fisherʼs exact test is used for ORA of gene lists for a single type of annota3on. •  P-­‐value for Fisherʼs exact test – is “the probability that a random draw of the same size as the gene list from the background popula3on would produce the observed number of annota3ons in the gene list or more.” , – and depends on size of both gene list and background popula3on as well and the number of specific genes in gene list and background. Ontologies and Ontology Enrichment •  Gene Ontology : a controlled vocabulary and machine readable, Directed Acyclic Graph with mul3ple levels of classifica3on, can be used across species •  3 types of gene ontology (BP, MF, CC) •  9 levels (generic to extremely specific) •  “is a” and “part of” rela3onships •  Evidence codes •  Allows for incomplete knowledge •  More than 80 types of biomedical ontologies. •  Enrichment tools: more than 400 •  Ex: NIH DAVID, GREAT, GOseq Pathway Analysis •  Iden3fy pathway that are perturbed or significantly impacted in a given condi3on or phenotype •  Many pathway databases (~550). Incomplete pathway descrip3ons. •  Provenance of annota3on. •  Signalling, Metabolic, Regulatory pathways Gene Set Enrichment (GSEA) •  A method for assessing whether a predefined set of genes show sta3s3cally significant differences between two condi3ons. Msigdb & GMT files •  A collec3on of gene set •  No centralized cura3on-­‐ user submiKed gene sets •  GSEA uses the list rank informa3on without using a threshold. •  The 10348 gene sets in the Molecular Signatures Database (MSigDB) are divided into 8 major collec3ons, and several sub-­‐collec3ons. Gene Set Analysis (GSA) •  It differs from a Gene Set Enrichment Analysis (Subramanian et al 2006) in its use of the "maxmean" sta3s3c: this is the mean of the posi3ve or nega3ve part of gene scores in the gene set, whichever is large in absolute values. Promoter Enrichment Analysis •  Iden3fy co-­‐regulated gene set or signature–Extremely Hard to do!! •  Co regulated genes may have similar biological func3on or involved in a biological process, show differen3al expression, involved in the same pathway, belong to the same cluster? •  Iden3fy the upstream regulators of that gene set. How? •  Locate promoters / regulatory elements of the genes set. •  Then search for transcrip3on factor binding sites (TFBS) that are enriched in that promoter set (compared to a random promoter set). PSCAN