Download Week 7-Microarrays

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome (book) wikipedia , lookup

Metagenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene desert wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
10/09/2015
Tools and Algorithms in Bioinformatics
GCBA815, Fall 2015
Week7: Microarray data analysis
Peng Xiao, Ph.D.
Assistant Professor
(Guda lab)
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Microarray
•  A high throughput technology allowing to investigate the
genome-wide gene expression simultaneously
•  A “snapshot” of the gene expression profile of a type of cells,
tissue, or organism
__________________________________________________________________________________________________
Fall 2015
GCBA 815
1
10/09/2015
Microarray technology
•  GeneChips
•  Contain short (25-mer) oligonucleotides as probes
•  About 400,000 different DNA spots per chip
•  Space for a lot of controls and multiple spots (16-20)
per gene
•  Mismatched oligos for background correction
•  Spotted Arrays
•  Concentrated DNA is prepared and spotted on glass
slides
•  Much cheaper than GeneChips
•  Designed to contain only the genes of interest
•  Usually limited to about 15,000 spots per slide
•  Limits controls and duplicates for each gene
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Two-channel vs one-channel detection
•  Two-channel (or two-color)
•  Typically used to compare expression
in two different treatments
•  Two types of cDNA labeling dyes
•  Cy3- has emission at 570nm
(corresponds to green)
•  Cy5- has emission at 670nm
(corresponds to red)
•  The two Cy-labeled cDNA samples
are mixed and hybridized to the a
microarray
•  Relative intensities are are used
detect up or down-regulated genes
__________________________________________________________________________________________________
Fall 2015
GCBA 815
2
10/09/2015
Two-channel vs one-channel detection
•  Single-channel (one-color)
•  A single-color dye, no dye bias from two colors
•  One array for one sample
•  One bad quality sample does not affect the other array
results, a problem in two-channel arrays
•  Results are easily comparable to arrays from different
experiments
•  Requires twice as many microarrays compared to twochannel arrays
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Experimental design
•  Controls (RNA spike-ins)
•  Negative control for background calculations
•  Normalization of Cy3 and Cy5 signals
•  Technical replicates
•  Multiple aliquots of the same sample are run separately to
account for technical variation
• Biological replicates
•  Multiple independent samples of the same kind are run
separately to account for sample to sample variation
•  Multiple spots for each gene on a single array to measure
statistical significance.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
3
10/09/2015
Affymetrix File Types
•  DAT file
•  Raw (TIFF) optical image of the hybridized chip
•  EXP file
•  Text file with experimental details
•  CEL file
•  Processed DAT file (with intensity/position values)
•  CDF file
•  Described layout of chip (provided by Affymetrix)
•  CHP file
•  Results created from CEL and CDF file
•  TXT file
•  CHP file in text format
•  RPT file
•  Report file with QC information
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Basic statistics
•  Raw intensities
•  Pre-processing, background correction
•  Transformed into expression values (called normalization)
•  Use algorithms such as MAS5, RMA/GCRMA, etc.
•  Fold change
•  Ratio of normalized intensities of experimental/control
•  Log(intensity) or log(ratio)
•  Brings the differential expression values from multiplicative
scale to additive scale
•  Statistical test ( student’s t-test, F-test/one-way ANOVA)
•  P-value (obtained from a statistical test)
__________________________________________________________________________________________________
Fall 2015
GCBA 815
4
10/09/2015
Log transformations
Normalized
Intensity
Natural Log
(loge/ln)
Log2
Log 10
1
0
0
0
10
2.30
3.32
1
20
3
4.32
1.3
100
4.61
6.62
2
1000
6.91
9.97
3
10000
9.21
13.29
4
50000
10.82
15.61
4.7
100000
11.51
16.61
5
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Public Repositories of Microarray Datasets
•  NCBI’s GEO (Gene Expression Omnibus)
•  Original expression datasets
•  Curated gene expression datasets
•  Cluster tools and differential expression queries
•  EBI’s Array Express
__________________________________________________________________________________________________
Fall 2015
GCBA 815
5
10/09/2015
Microarray Data Analysis
Using BRB ArrayTools
§ 
BRB ArrayTools
§ 
§ 
R software
§ 
§ 
http://linus.nci.nih.gov/download_BRBArrayTools.html
https://www.r-project.org/
Bioconductor packages in R
§ 
https://www.bioconductor.org/install/
__________________________________________________________________________________________________
Fall 2015
GCBA 815
BRB-ArrayTools
An Integrated Software Tool for
DNA Microarray Analysis
n 
Developed under the direction of Dr. Richard Simon of
the Biometrics Research Branch, NCI.
n 
Software was developed with the purpose of deploying
powerful statistical tools for use by biologists.
n 
Analyses are launched from user-friendly Excel
interface. Also requires installation of a free software
called R for running back-end programs.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
6
10/09/2015
Data input to BRB-ArrayTools
Expression data
(one or more files)
Excel workbook containing
a single worksheet
(or simply a text file)
Gene identifiers
(may be in a
separate file)
Excel workbook containing
a single worksheet
(or simply a text file)
Experiment
descriptors
Collate
Collated data
Workbook
Excel workbook with
multiple worksheets
Filter
Excel workbook containing
a single worksheet
(or simply a text file)
User defined
gene lists
Run analyses
One or more text
files
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Expression data
n 
Input data as tab-delimited text files (or single-sheet Excel files) in
one of the following two formats:
1. Horizontally aligned
2. Separate files
n 
Files may contain expression data in the form of signal (or singlechannel expression summary), dual-channel intensities, or
expression ratios (for dual-channel data).
n 
For Affymetrix data, raw expression CEL files could be imported
directly.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
7
10/09/2015
Expression data
Horizontally aligned data example
Array data
block #1
Array data
block #2
Array data
block #3
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Gene identifiers
n 
A gene identifiers file is used for annotation
purposes.
n 
Gene identifiers: e.g. clone id, UniGene cluster id,
gene symbol, GenBank accession, and probeset
id.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
8
10/09/2015
Gene identifiers
An example of a gene identifier file
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Experiment (Array) descriptors
n 
An experiment descriptors file describes the samples and the
group information used for each array.
n 
After the header row, each row in this file represents one array
or sample, and each column represents one descriptor variable.
n 
A typical descriptors file at least includes the three columns:
array id, sample/patient id, and group information
__________________________________________________________________________________________________
Fall 2015
GCBA 815
9
10/09/2015
Experiment descriptors
Describes the samples used for each array
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Automatic data importers
n 
General format data: Guide through the specification of the data
components.
n 
Affymetrix CEL files - Data Import Wizard, Affymetrix Gene ST Array
Importer
n 
NCBI GDS/GSE: Automatic import NCBI GEO array data by providing
GDS/GSE number/
__________________________________________________________________________________________________
Fall 2015
GCBA 815
10
10/09/2015
Data filtering options
Spot filters
Single-Channel
Intensity filter: Filter out spots with low intensity.
n 
Detection Call: Exclude a probeset if the Detection call
value is “A”,”M” ,“P” or “No Call”.
Dual channel
Background correction and averaging replicate spots can
be performed.
n 
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Data filtering options
Normalization
n 
Transformation: A log2 transformation applied to the
signal intensities before normalization
n 
Normalization:
n 
¨ 
Single channel: quantile normalization, specific target
percentile, relative to reference array, and relative to
reference array groups
¨ 
Duel channel: median normalization, housekeeping gene
normalization, lowess normalization, and print-tip group
normalization
Truncation: Truncate extreme values
__________________________________________________________________________________________________
Fall 2015
GCBA 815
11
10/09/2015
Data filtering options
Gene filters
n 
Fold-change filter: Specify a minimum percentage
of log-expression values which must meet a
specified fold-change criteria
n 
Log-intensity variation filter:
Screen genes which do not vary much over the set
of samples:
1. Percentile criterion screens a specified
percentage of genes with smallest variance
2. Significance criterion compares the variance of
each gene against the “average” gene
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Data filtering options
Gene filters: Gene quality
n 
Percent missing: Screens out genes/probesets
that contain too many missing values over the set
of samples.
n 
Minimum Intensity: only available for single channel
data. It filters out genes whose Nth percentile
normalized log intensity is less than the log of the
user defined value.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
12
10/09/2015
Data filtering options
Gene subsets
n 
Select genelists for analysis: subset the data by
selecting one or more genelists to INCLUDE or
EXCLUDE.
n 
Specify gene labels to exclude: exclude genes
based on gene identifier labels.
n 
Probe reduction: Reduce multiple probe sets per
gene by choosing the most variably expressed or the
maximally expressed probe/probeset.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
Class Comparison
(statistical analysis)
n  Class
comparison between groups of
arrays
n  SAM
n  Gene set expression comparison
n  Between red and green channels
n  ANOVA models
__________________________________________________________________________________________________
Fall 2015
GCBA 815
13
10/09/2015
Other tools
n 
Graphics: different plots.
n 
Clustering: heatmap, hierarchical and NMF (nonnegative matrix factorization) clusterings.
n 
Survival analysis: for cancer and other diseases.
n 
Methylation tools: find frequently methylated probes and
correlate methylation with gene expression.
n 
CGHTools: analyze copy number variation.
__________________________________________________________________________________________________
Fall 2015
GCBA 815
14