* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
History of genetic engineering wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Heritability of IQ wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Essential gene wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
Metagenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Microarray data analysis with Chipster 22.9.2008 Jarno Tuimala Program – an analysis workflow Basic functionality of Chipster Data import Quality control Normalization • Describing the experiment Filtering and missing value considerations Statistical testing Clustering and visualization Annotation Introduction to Chipster Chipster Goal: Easy access to leading analysis tools such as those developed in the R/Bioconductor project Features • Easy to use graphical user interface • Comprehensive selection of tools • Support for different array types (Affymetrix, Agilent, Illumina, cDNA) • Compatible with Windows, Linux and Mac OS X • Easy to install and update • Wizards and workflows • Interactive graphics • Transparency (as opposed to “black box”) • Alternative annotations for Affymetrix arrays • Automatic tracking of performed analyses http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf http://chipster.csc.fi How does it work? CSC internet front end security desktop SSL client SOAP Java Web Start installs and updates client automatically analyser Corona/Murska ANALYSIS international Web Services VISUALISATION Phenodata – describing your experiment Phenodata file is created during normalization Fill in the group column with numbers describing your experimental setup • • e.g. 1 = healthy control, 2 = cancer sample necessary for the statistical tests to work If you bring in previously created normalized data and phenodata: • • Choose ”import directly” in the import tool Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation” If you brought in normalized data and need to create phenodata for it: • • • Utilities/ Generate phenodata (fill in the chiptype parameter!) Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation” Fill in the group column Visualizing the data Data visualization panel • Maximize and redraw for better viewing Two types of visualizations 1. Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data visualization panel • Save by right clicking on the image 2. Static images produced by R/Bioconductor, Weeder, etc • Select from Analysis tools/ Visualisation • View by double clicking on the image file • Save by right clicking on the file name and choosing ”Export” Interactive visualizations by the client Spreadsheet Histogram Scatterplot 3D scatterplot Expression profiles Clustered profiles Hierarchical clustering SOM clustering Array pseudo-image Venn diagram Available actions: Change titles, colors etc Zoom in/out Static images produced by R/Bioconductor Volcano plot Box plot Histogram Heatmap Venn diagram Idiogram Chromosomal position Correlogram Dendrogram QC stats plot RNA degradation plot K-means clustering SOM-clustering Automatic tracking of analysis history Running many analyses simultaneously You can have max 5 analysis jobs running at the same time Use Task manager to • • view parameters, status,… cancel jobs Workspace – continue later/elsewhere Saving your workspace allows you to continue later • • File/ Save workspace File/ Load workspace Currently it is possible to have only one workspace saved at the time If you would like to continue your work on another computer, you need to transfer the workspace-snapshot -folder to the corresponding location • C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot Importing files Affymetrix CEL-files are imported to Chipster automatically Other files are imported using the Import tool Import tool, step 1 Define • • • • Header Footer Title row Delimiter Import tool, step 2 Define columns Modify flags Importing Agilent files (required fields) Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeName) Annotation (ControlType) Flag (IsManualFlag) https://extras.csc.fi/biosciences/chipster-manual/data-formats.html Quality control Quality control tools Quality control -tools • • Affymetrix basic RNA degradation + Affy QC Agilent MA-plot + density plot + boxplot Visualization – dendrogram Statistics - NMDS Affymetrix I Quality control tools are run on raw data (CEL files). • Dendrogram and NMDS on normalized data Agilent General QC – dendrogram and NMDS Scatterplots Heatmaps (this took an hour to calculate) QC-tools in Chipster Quality control • • • Affymetrix basic Affymetrix RLE and NUSE Agilent 2-color Visualization • • • Dendrogram Heatmap Correlogram Statistics • NMDS Normalization What is normalization? Normalization is the process of removing systematic variation from the data. Typically you would normalize your data so that all the chips become comparable. Methods Affymetrix • • Background correction + expression estimation + summarization RMA (default) uses only PM probes, fits a model to them, and gives out expression values after quantile normalization and median polishing Agilent • Background correction + averaging duplicate spots + normalization After normalization the expression values are always expressed on log2-scale Affymetrix Methods: MAS5, Plier, RMA, GCRMA, Li-Wong • • • • MAS5 is the older Affymetrix method, Plier is a newer one RMA is the default, and works rather nicely if you have more than a few chips GCRMA is similar to RMA, but takes also GC% content into account Li-Wong is the method implemented in dChip Variance stabilization makes the variance over all the chips similar • Works only with MAS5 and Plier, since all others output log2tranformed data by default (and thus corrected for the same phenomenon) Custom chiptype • If you want to use reannotated probes (they are really assigned to the genes where they belong), select one from this menu. Agilent I Background correction • • Background treatment None, Subtract, Edwards, Normexp Background offset 0 or 50 Normalize chips • None, median, loess Normalize genes (not typically used) • None, scale (to median), quantile Chiptype • A must setting! Agilent II Background treatment typically generates many negative values that are coded as missing values after log2-transformation. • • Usual subtract option does this Using normexp + offset 50 will generate no negative values, and gives rather good estimates (best method reported) Loess removes curvature from the data (suggested) Checking normalization Filtering Gene filtering Removing probes for genes that are • • Not expressed Expressed at constant level (not changing) Often a good idea, and necessary before multiple testing correction can be adequately applied • Some controversy on this… Non-specific filtering • Expression, flags, SD, … Specific filtering • Statistical testing Non-specific filtering Often used for removing bad quality data: • • • Intensity value too low Intensity value saturated Appearance of the spot is abnormal Typically, non-changing genes are also removed These can be removed using • • • Filter by standard deviation Filter by interquartile range Filter by expression Specific filtering Selecting genes that are associated with some phenotype Typically involves statistical testing Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value. • • Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect. Take both into account by combining the filters. • Filter on expression value (what is biologically significant) and test for differences (what is statistically significant) Unspecific filtering in Chipster Pre-processing • • • • Filter by expression • Select the upper and lower cut-offs • Select the number of chips this rule has to fulfilled on • Select whether to return genes inside or outside the range Filter by SD • Select the percentage of genes to filter out Filter by interquartile range (IQR) • Select the IQR Filter by coefficient of variation (CV) • Median is used for filtering on CV (cannot be changed) Utilities 1. Calculate descriptive statistics 2. Filter using a column Venn diagram Select three datasets in Chipster Run the Venn diagram tool from Visualization tool category SD CV IQR Statistics Some terminology Usually tests for comparing means of two or more groups are used • Variance might be of interest too, but in practise this is never done. Parametric tests (assume data normally distributed) • Typically used for microarray data Non-parametric tests (assume no normality) P-value • • • Risk of saying that there is a difference when there really isn’t Traditionally 0.05 is used as a cut-off for significance False discovery range is a p-value corrected for multiple tests (more on this later) Mean and variance, an example for 1 gene density.default(x = y1) 0.3 0.0 0.1 0.2 Density 0.2 0.1 0.0 Density 0.3 0.4 0.4 density.default(x = x1) -6 -4 -2 0 2 N = 100000 Bandwidth = 0.08956 4 6 -10 -5 0 N = 100000 Bandwidth = 0.09006 5 10 Statistical testing Needs replication (>2 chips per group) • Replication makes it possible to estimate uncertainty or variability in the measurements. This is typically measured by standard deviation. Comparing means (parametric tests) • • • • One-group tests • Compare to a known mean • Example: One-sample t-test Two-group tests • Compare two groups’ means • Example: Two-sample t-test Several group tests • Compare several groups’ means • Example: Analysis of variance (ANOVA) Two or more groups, two or more factors • Compare means in the groups according to both factor simultaneously • Example: multiple linear regression (linear modeling in Chipster) t-test Compares means of two groups • • • If the p-value is small that means that there is a difference between the groups. If the p-value is large (>0.05), there is no difference between the groups. p-value is a risk of saying that there is a difference when there actually isn’t. A test for every gene is run separately -> thousands of tests and p-values x1 x2 t SE ANOVA A generalization of t-test. Compares means of several groups. Tells whether the means are different, but not which means differ from each other. • For this you can use post-hoc tests (not implemented in Chipster) or linear modelling (implemented in Chipster) A test for every gene is run separately -> thousands of tests and p-values Multiple testing correction I After getting the results for all the genes, p-values are adjusted for the number of tests conducted. When making several comparisons using the same test, some of the results will be chance findings. • Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50). This can be corrected for (to some extent) by using a multiple testing correction. • • Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives. Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating. Multiple testing correction II The ranking of the genes does not change after multiple testing correction! • • If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before or after the multiple testing correction. If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate. Gene set test (”global test”) A typical result of an microarray experiment is a list of differentially expressed genes. Biologically, grouping these genes in pathways or functional categories would be more interesting. Are pathways associated with our endpoints of interest? • Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls? Works on the expression values data. Gene enrichment analysis A typical result of an microarray experiment is a list of differentially expressed genes. Biologically, grouping these genes in pathways or functional categories would be more interesting. Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories. Works on the gene list. Statistical tests in Chipster Statistics • • • • One sample tests • Are the genes expressed at all (different from 0)? Two group tests Several group tests Linear modeling Visualization • Volcano plot Clustering Clustering methods Hierarchical clustering Non-hierarchical clustering • • • K-means QT-clustering Self-organizing maps Classification / class prediction • K-nearest neighbor (KNN) Hierachical clustering Two phases: • • Pick a distance measure • Euclidean distance • Standard / Pearson correlation Pick the dendrogram drawing method • Average linkage Average linkage example Hierarchical clustering - heatmap Annotation Annotation Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc. Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project). Required by certain analysis tools (annotation, GO enrichment, promoter analysis, chromosomal plots) • These tools don’t work for those chiptypes which don’t have Bioconductor annotation packages Alternative CDF environments for Affy CDF is a file that links individual probes to their location in genes (probesets) Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genes Alternative CDFs fix this problem In Chipster • • selecting ”custom chiptype” in Affymetrix normalization takes altCDFs to use Note: if you have normalized using a custom chiptype, certain tools requiring annotation won’t work (GO term enrichment, promotor analysis, annotation) Dai et al, (2005) Nuc Acids Res, 33(20):e175 http://brainarray.mbni.med.umich.edu/Brainarray/Database/Custom CDF/genomic_curated_CDF.asp