Download Analyzing Microarray Gene Expression Data

Document related concepts

List of types of proteins wikipedia , lookup

JADE1 wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Statistical Methods for the Screening
and
Classification of Microarray Gene Expression
Data
Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
University of Queensland
http://www.maths.uq.edu.au/~gjm
Institute for Molecular Bioscience,
University of Queensland
Liat Jones
Richard Bean
Justin Zhu
Outline of Workshop
Part 1: Introduction to Microarray Technology
Part 2: Detecting Differentially Expressed Genes in Known
Classes of Tissue Samples
Part 3: Supervised Classification of Tissue Samples
Part 4: Unsupervised Classification: Cluster Analyis of Tissue
Samples and Gene Profiles
Part 5: Linking Microarray Data with Survival Analysis
A microarray is a new technology which allows
the measurement of the expression levels of
thousands of genes simultaneously.
(1) Sequencing of the genome (human, mouse, and others)
(2) Improvement in technology to generate high-density
arrays on chips (glass slides or nylon membrane)
The entire genome of an organism can be
probed at a single point in time.
Draft of the Human Genome
Public Sequence
Nature, Feb. 2001
Celera Sequence
Science, Feb. 2001
The Challenge for Statistical Analysis of
Microarray Data
Microarrays present new problems for statistics
because the data are very high dimensional with
very little replication.
The challenge is to extract useful information and
discover knowledge from the data, such as gene
functions, gene interactions, regulatory pathways,
metabolic pathways etc.
Vital Statistics
by C. Tilstone
Nature 424, 610-612, 2003.
“DNA microarrays have
given geneticists and
molecular biologists
access to more data than
ever before. But do these
researchers have the
statistical know-how to
cope?”
Branching out: cluster
analysis can group
samples that show
similar patterns of gene
expression.
Representation of Data from M Microarray Experiments
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
Assume we have
extracted gene
expressions values
from intensities.
It is assumed that the (logged) expression
levels have been preprocessed with
adjustment for array effects.
• Majority of time on a data analysis project will be spent
“cleaning” the data before doing any analysis
• Paradoxically, most statistical training assumes that the data
arrive “prelceaned.” Students, whether in PhD programs
or an undergraduate introductory course, are not taught
routinely to check data for accuracy or even to worry about it.
Exacerbating the problem further are claims by software vendors
that their techniques can produce valid results no matter what the
quality of the incoming data.
De Veaux and Hand (How to Lie with Bad Data, Statist. Sci., 2005)
“Large-scale gene expression studies are not
a passing fashion, but are instead one aspect
of new work of biological experimentation,
one involving large-scale, high throughput
assays.”
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and
Hall/ CRC
Growth of microarray and microarray methodology literature listed in PubMed from 1995 to
2003.
The category ‘all microarray papers’ includes those found by searching PubMed for
microarray* OR ‘gene expression profiling’. The category ‘statistical microarray papers’
includes those found by searching PubMed for ‘statistical method*’ OR ‘statistical
techniq*’ OR ‘statistical approach*’ AND microarray* OR ‘gene expression profiling’.
Mehta et al (Nature Genetics, Sept. 2004):
“The field of expression data analysis is particularly
active with novel analysis strategies and tools being
published weekly”, and the value of many of these
methods is questionable. Some results produced by using
these methods are so anomalous that a breed of ‘forensic’
statisticians (Ambroise and McLachlan, 2002; Baggerly et
al., 2003) who doggedly detect and correct other HDB
(high-dimensional biology) investigators’ prominent
mistakes, has been created.
Analyzing Microarray Gene Expression Data
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
Analyzing Microarray Gene Expression Data (UQ, Wiley)
Analysis of Microarray Gene Expression Data (Harvard, Kluwer)
The Analysis of Gene Expression Data (Johns Hopkins, Springer)
The Statistical Analysis of Gene Expression Data (Berkeley, C&H)
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
Statistics for Microarrays
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
Statistics for Microarrays
Design and Analysis of DNA Microarrays
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
Statistics for Microarrays
Design and Analysis of DNA Microarrays
Exploration and Analysis of Microarrays
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
Statistics for Microarrays
Design and Analysis of DNA Microarrays
Exploration and Analysis of Microarrays
Data Analysis Tools for DNA Microarrays
In the sequel, references to most of the material presented
can be found in my joint book,
McLachlan, Do, and Ambroise (2004), Analyzing Microarray
Gene Expression Data, Hoboken, NJ: Wiley.
Contents
1. Microarrays in Gene Expression Studies
2. Cleaning and Normalization
3. Some Cluster Analysis Methods
4. Clustering of Tissue Samples
5. Screening and Clustering of Genes
6. Discriminant Analysis
7. Supervised Classification of Tissue
Samples
8. Linking Microarray Data with Survival
Analysis
Distribution of References by Year
Year
#
2004
34
2003
73
2002
80
2001
93
2000
47 (67.8%)
Total
481
mRNA Levels Indirectly Measure Gene Activity
• Essentially
every cell
contains the
same genes.
• Cells differ in
the genes which
are active at any
one time.
• Gene Expression
is transcription of
DNA to mRNA
• mRNA is
translated to
proteins
• Type and
amount of mRNA
produced by a
cell tells which
genes are
being expressed
Technical Background
Two recent advances:
• Human Genome Project (also other
sequenced genomes: mouse, dog etc)
• DNA microarray technology -- works by
exploiting the ability of a given mRNA
molecule to bind specifically to (hybridize)
the DNA template from which it originated
What is a DNA microarray?
• Small, solid supports onto which the sequences
from thousands (tens of thousands) of genes are
attached at fixed locations.
• They may be glass slides, or silicon chips or
nylon membranes.
• The DNA is printed, spotted or synthesized
directly onto the support
• The spots can be DNA, cDNA or
oligonucleotides.
The microarray experiment
Spot DNA
(known)
Sample
(unknown)
Microarrays Indirectly Measure Levels of mRNA
•mRNA is extracted from the cell
•mRNA is reverse transcribed to cDNA (mRNA itself is unstable)
•cDNA is labeled with fluorescent dye TARGET
•The sample is hybridized to known DNA sequences on the array
(tens of thousands of genes) PROBE
•If present, complementary target binds to probe DNA
(complementary base pairing)
•Target bound to probe DNA fluoresces
The microarray experiment
• mRNA from the cell (sample) is washed
over the surface – HYBRIDIZATION
• measure the amount of bound mRNA at
each spot
Allows the measurement of expression for
thousands of genes from the amount of bound
mRNA.
A Spotted cDNA Microarray Experiment
• Compare the gene expression levels for
two cell populations on a single microarray.
• e.g. tumour and normal cells
Microarray Image
Red: High expression in target labelled with cyanine 5 dye
Green : High expression in target labelled with cyanine 3 dye
Yellow : Similar expression in both target samples
Assumptions:
Gene Expression
cellular mRNA levels directly
reflect gene expression
(1)
mRNA
(2)
intensity of bound target is a
measure of the abundance of the
mRNA in the sample.
Fluorescence Intensity
Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
Why are microarrays important?
• They contain a very large number of genes
and are very small.
• Compare gene expression within a single
sample or in two different cell types or tissue
samples
• Examine expressions in a single sample on
a genome-wide scale (GENOMICS)
• Infer new gene functions, diagnostic tools –
e.g. in cancer provides a molecular view.
The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto
glass slides
short oligonucleotides synthesized in situ
onto glass wafers
Each gene (or EST) represented
by its purified PCR product
Each gene represented multiply - using
16-20 (preferably non-overlapping)
25-mers.
Simultaneous analysis of two
samples (treated vs untreated cells)
provides internal control.
Each oligonucleotide has single-base
mismatch partner for internal control of
hybridization specifity.
relative gene expressions
absolute gene expressions
Each with its own advantages and disadvantages
Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
Flexible and cheaper
More expensive yet less flexible
Allows study of genes not yet sequenced Good for whole genome expression
(spotted ESTs can be used to discover
analysis where genome of that organism
new genes and their functions)
has been sequenced
Variability in spot quality from slide to
slide
High quality with little variability between
slides
Provide information only on relative
gene expressions between cells or tissue
samples
Gives a measure of absolute expression
of genes
Aims of a Microarray Experiment
• observe changes in a gene in response to external stimuli
(cell samples exposed to hormones, drugs, toxins)
• compare gene expressions between different tissue types
(tumour vs normal cell samples)
To gain understanding of
• function of unknown genes
• disease process at the molecular level
Ultimately to use as tools in Clinical Medicine for diagnosis,
prognosis and therapeutic management.
Importance of Experimental Design
• Good DNA microarray experiments should
have clear objectives.
• Not performed as “aimless data mining in
search of unanticipated patterns that will
provide answers to unasked questions”
(Richard Simon, BioTechniques 34:S16-S21, 2003)
Replicates
Technical replicates: arrays that have been hybridized to the same
biological source (using the same treatment, protocols, etc.)
Biological replicates: arrays that have been hybridized to different
biological sources, but with the same preparation, treatments, etc.
Extracting Data from the Microarray
•Cleaning
Image processing
Filtering
Missing value estimation
•Normalization
Remove sources of systematic variation.
Sample 1
Sample 2
Sample 3
Sample 4 etc…
Examples of spot imperfections. A. donut shape; B. oval or pear
shape; C. holey heterogeneous interior; D. high-intensity artifact;
E. sickle shape; F. scratches.
Gene Expressions from Measured Intensities
Spotted Microarray:
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix:
(Perfect Match Intensity – Mismatch Intensity)
Data Transformation

log x  c  x
2
2

Rocke and Durbin (2001), Munson (2001), Durbin et al. (2002),
and Huber et al. (2002)
Representation of Data from M Microarray Experiments
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
Assume we have
extracted gene
expressions values
from intensities.
Gene expressions
can be shown as
Heat Maps
It is assumed that the (logged) expression
levels have been preprocessed with
adjustment for array effects.