* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Week 7-Microarrays
Vectors in gene therapy wikipedia , lookup
Genome (book) wikipedia , lookup
Metagenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene therapy wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene desert wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
10/09/2015 Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week7: Microarray data analysis Peng Xiao, Ph.D. Assistant Professor (Guda lab) Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center __________________________________________________________________________________________________ Fall 2015 GCBA 815 Microarray • A high throughput technology allowing to investigate the genome-wide gene expression simultaneously • A “snapshot” of the gene expression profile of a type of cells, tissue, or organism __________________________________________________________________________________________________ Fall 2015 GCBA 815 1 10/09/2015 Microarray technology • GeneChips • Contain short (25-mer) oligonucleotides as probes • About 400,000 different DNA spots per chip • Space for a lot of controls and multiple spots (16-20) per gene • Mismatched oligos for background correction • Spotted Arrays • Concentrated DNA is prepared and spotted on glass slides • Much cheaper than GeneChips • Designed to contain only the genes of interest • Usually limited to about 15,000 spots per slide • Limits controls and duplicates for each gene __________________________________________________________________________________________________ Fall 2015 GCBA 815 Two-channel vs one-channel detection • Two-channel (or two-color) • Typically used to compare expression in two different treatments • Two types of cDNA labeling dyes • Cy3- has emission at 570nm (corresponds to green) • Cy5- has emission at 670nm (corresponds to red) • The two Cy-labeled cDNA samples are mixed and hybridized to the a microarray • Relative intensities are are used detect up or down-regulated genes __________________________________________________________________________________________________ Fall 2015 GCBA 815 2 10/09/2015 Two-channel vs one-channel detection • Single-channel (one-color) • A single-color dye, no dye bias from two colors • One array for one sample • One bad quality sample does not affect the other array results, a problem in two-channel arrays • Results are easily comparable to arrays from different experiments • Requires twice as many microarrays compared to twochannel arrays __________________________________________________________________________________________________ Fall 2015 GCBA 815 Experimental design • Controls (RNA spike-ins) • Negative control for background calculations • Normalization of Cy3 and Cy5 signals • Technical replicates • Multiple aliquots of the same sample are run separately to account for technical variation • Biological replicates • Multiple independent samples of the same kind are run separately to account for sample to sample variation • Multiple spots for each gene on a single array to measure statistical significance. __________________________________________________________________________________________________ Fall 2015 GCBA 815 3 10/09/2015 Affymetrix File Types • DAT file • Raw (TIFF) optical image of the hybridized chip • EXP file • Text file with experimental details • CEL file • Processed DAT file (with intensity/position values) • CDF file • Described layout of chip (provided by Affymetrix) • CHP file • Results created from CEL and CDF file • TXT file • CHP file in text format • RPT file • Report file with QC information __________________________________________________________________________________________________ Fall 2015 GCBA 815 Basic statistics • Raw intensities • Pre-processing, background correction • Transformed into expression values (called normalization) • Use algorithms such as MAS5, RMA/GCRMA, etc. • Fold change • Ratio of normalized intensities of experimental/control • Log(intensity) or log(ratio) • Brings the differential expression values from multiplicative scale to additive scale • Statistical test ( student’s t-test, F-test/one-way ANOVA) • P-value (obtained from a statistical test) __________________________________________________________________________________________________ Fall 2015 GCBA 815 4 10/09/2015 Log transformations Normalized Intensity Natural Log (loge/ln) Log2 Log 10 1 0 0 0 10 2.30 3.32 1 20 3 4.32 1.3 100 4.61 6.62 2 1000 6.91 9.97 3 10000 9.21 13.29 4 50000 10.82 15.61 4.7 100000 11.51 16.61 5 __________________________________________________________________________________________________ Fall 2015 GCBA 815 Public Repositories of Microarray Datasets • NCBI’s GEO (Gene Expression Omnibus) • Original expression datasets • Curated gene expression datasets • Cluster tools and differential expression queries • EBI’s Array Express __________________________________________________________________________________________________ Fall 2015 GCBA 815 5 10/09/2015 Microarray Data Analysis Using BRB ArrayTools § BRB ArrayTools § § R software § § http://linus.nci.nih.gov/download_BRBArrayTools.html https://www.r-project.org/ Bioconductor packages in R § https://www.bioconductor.org/install/ __________________________________________________________________________________________________ Fall 2015 GCBA 815 BRB-ArrayTools An Integrated Software Tool for DNA Microarray Analysis n Developed under the direction of Dr. Richard Simon of the Biometrics Research Branch, NCI. n Software was developed with the purpose of deploying powerful statistical tools for use by biologists. n Analyses are launched from user-friendly Excel interface. Also requires installation of a free software called R for running back-end programs. __________________________________________________________________________________________________ Fall 2015 GCBA 815 6 10/09/2015 Data input to BRB-ArrayTools Expression data (one or more files) Excel workbook containing a single worksheet (or simply a text file) Gene identifiers (may be in a separate file) Excel workbook containing a single worksheet (or simply a text file) Experiment descriptors Collate Collated data Workbook Excel workbook with multiple worksheets Filter Excel workbook containing a single worksheet (or simply a text file) User defined gene lists Run analyses One or more text files __________________________________________________________________________________________________ Fall 2015 GCBA 815 Expression data n Input data as tab-delimited text files (or single-sheet Excel files) in one of the following two formats: 1. Horizontally aligned 2. Separate files n Files may contain expression data in the form of signal (or singlechannel expression summary), dual-channel intensities, or expression ratios (for dual-channel data). n For Affymetrix data, raw expression CEL files could be imported directly. __________________________________________________________________________________________________ Fall 2015 GCBA 815 7 10/09/2015 Expression data Horizontally aligned data example Array data block #1 Array data block #2 Array data block #3 __________________________________________________________________________________________________ Fall 2015 GCBA 815 Gene identifiers n A gene identifiers file is used for annotation purposes. n Gene identifiers: e.g. clone id, UniGene cluster id, gene symbol, GenBank accession, and probeset id. __________________________________________________________________________________________________ Fall 2015 GCBA 815 8 10/09/2015 Gene identifiers An example of a gene identifier file __________________________________________________________________________________________________ Fall 2015 GCBA 815 Experiment (Array) descriptors n An experiment descriptors file describes the samples and the group information used for each array. n After the header row, each row in this file represents one array or sample, and each column represents one descriptor variable. n A typical descriptors file at least includes the three columns: array id, sample/patient id, and group information __________________________________________________________________________________________________ Fall 2015 GCBA 815 9 10/09/2015 Experiment descriptors Describes the samples used for each array __________________________________________________________________________________________________ Fall 2015 GCBA 815 Automatic data importers n General format data: Guide through the specification of the data components. n Affymetrix CEL files - Data Import Wizard, Affymetrix Gene ST Array Importer n NCBI GDS/GSE: Automatic import NCBI GEO array data by providing GDS/GSE number/ __________________________________________________________________________________________________ Fall 2015 GCBA 815 10 10/09/2015 Data filtering options Spot filters Single-Channel Intensity filter: Filter out spots with low intensity. n Detection Call: Exclude a probeset if the Detection call value is “A”,”M” ,“P” or “No Call”. Dual channel Background correction and averaging replicate spots can be performed. n __________________________________________________________________________________________________ Fall 2015 GCBA 815 Data filtering options Normalization n Transformation: A log2 transformation applied to the signal intensities before normalization n Normalization: n ¨ Single channel: quantile normalization, specific target percentile, relative to reference array, and relative to reference array groups ¨ Duel channel: median normalization, housekeeping gene normalization, lowess normalization, and print-tip group normalization Truncation: Truncate extreme values __________________________________________________________________________________________________ Fall 2015 GCBA 815 11 10/09/2015 Data filtering options Gene filters n Fold-change filter: Specify a minimum percentage of log-expression values which must meet a specified fold-change criteria n Log-intensity variation filter: Screen genes which do not vary much over the set of samples: 1. Percentile criterion screens a specified percentage of genes with smallest variance 2. Significance criterion compares the variance of each gene against the “average” gene __________________________________________________________________________________________________ Fall 2015 GCBA 815 Data filtering options Gene filters: Gene quality n Percent missing: Screens out genes/probesets that contain too many missing values over the set of samples. n Minimum Intensity: only available for single channel data. It filters out genes whose Nth percentile normalized log intensity is less than the log of the user defined value. __________________________________________________________________________________________________ Fall 2015 GCBA 815 12 10/09/2015 Data filtering options Gene subsets n Select genelists for analysis: subset the data by selecting one or more genelists to INCLUDE or EXCLUDE. n Specify gene labels to exclude: exclude genes based on gene identifier labels. n Probe reduction: Reduce multiple probe sets per gene by choosing the most variably expressed or the maximally expressed probe/probeset. __________________________________________________________________________________________________ Fall 2015 GCBA 815 Class Comparison (statistical analysis) n Class comparison between groups of arrays n SAM n Gene set expression comparison n Between red and green channels n ANOVA models __________________________________________________________________________________________________ Fall 2015 GCBA 815 13 10/09/2015 Other tools n Graphics: different plots. n Clustering: heatmap, hierarchical and NMF (nonnegative matrix factorization) clusterings. n Survival analysis: for cancer and other diseases. n Methylation tools: find frequently methylated probes and correlate methylation with gene expression. n CGHTools: analyze copy number variation. __________________________________________________________________________________________________ Fall 2015 GCBA 815 14