Download EXPERIMENTAL DESIGN is - Universitat de Barcelona

Document related concepts

NEDD9 wikipedia , lookup

Twin study wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Essential gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Heritability of IQ wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Metagenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Some Statistical Issues
in Microarray Data Analysis
Alex Sánchez
Estadística i Bioinformàtica
Departament d’Estadística Universitat de Barcelona
Unitat d’Estadística i BioinformàticaIR-HUVH
Outline



Introduction
Experimental design
Selecting differentially expressed genes
 Statistical
tests
 Significance testing
 Linear models and Analysis of the variance
 Multiple testing

Software for microarray data analysis
2
Introduction
Microarray experiments: Overview
4
Why are we talking of statistics?

A microarray experiment is, as called, an
experiment, that is:
 It
has been performed to determine if some
previous hypothesis are true or false
(although it can also lead to new hypotheses)
 It is subject to errors which may arise from
many sources
5
Sources of variability


Biological Heterogeneity in Population
Specimen Collection/ Handling Effects


Tumor: surgical bx, FNA
Cell Line: culture condition, confluence
level

Biological Heterogeneity in Specimen
RNA extraction
RNA amplification

Fluor labeling

Hybridization

Scanning
– PMT voltage
– laser power




(Geschwind, Nature Reviews Neuroscience, 2001)
6
Categories of variability

Systematic variability
 Amount
of RNA in the
biopsy
 Efficiencies of lab
procedures such as:




RNA extraction,
reverse transcription,
Labeling or
photodetection

Random variation
 PCR
yield
 DNA quality
 spotting efficiency,
 spot size
 cross-/unspecific
hybridization
 stray signal
7
Dealing with systematic variability
 Systematic
variability has similar effects on
many measurements
 Corrections can be estimated from data
 CALIBRATION or NORMALIZATION is the
general name for processes that correct for
systematic variability
8
Dealing with random variation
 Random
variation cannot be explicitly
accounted for
 Usual way to deal with it is to assume some
ERROR MODELS (e.g. ei~N(0, s2))
 Assuming these error models are true…
EXPERIMENTAL DESIGN is (must be) used to
control the action of random variation
 STATISTICAL INFERENCE is (must be) used to
extract conclusions in the presence of random
variation

9
Biological question
Experimental design
Failed
Microarray experiment
Quality
Measurement
Image analysis
Today
Normalization
Pass
Analysis
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
10
Experimental design
Why experimental design?

The objective of experimental design is to
make the analysis of the data and the
interpretation of the results
 As
simple and as powerful as possible
 Given the purpose of the experiment
 And the constraints of the experimental
material
12
Scientific aims and design choice

The primary focus of the experiments
needs to be clearly stated, whether it is:
 to
identify differentially expressed genes
 to search for specific gene-expression patterns
 to identify phenotypic subclasses

Aim of the experiment guides design choice
 Sometimes
only one choice is reasonable
 Sometimes different options available
13
Designing microarray experiments

The appropriate design of a microarray
experiment must consider
 Design
of the array
 Allocation of mRNA samples to the slides
14
I: Layout of the array

Which sequences to use
 cDNA’s

Selection of cDNA from library
Riken, NIA, etc
 Affymetrix

Oligo probes selection (from Operon, Agilent, etc)
 Control


PM’s and MM’s
probes
What %?. Where should controls be put
How many sequences to use

Should there be replicate spots within a slide?
15
II: Allocating samples in slides

Types of Samples
 Replication:
technical vs biological
 Pooled vs individual samples

Different design layout / data analysis:
 Scientific
aim of the experiment
 Efficiency, Robustness, Extensibility

Physical limitations (cost) :
 Number
of slides
 Amount of material
16
Basic principles of experimental design

Apply the following principles to best attain
the objectives of experimental design
 Replication
 Local
control or Blocking
 Randomization
17
1. Replication

It’s important
 To
reduce uncertainty (increase precision)
 To obtain sufficient power for the tests
 As a formal basis for inferential procedures


s X2 
 var  X  

n


Consider different types of replicates
 Technical
 Duplicate spots
 Multiple hybridizations from the same sample
 Biological

Repeat most what is expected to vary most!
18
Biological vs Technical Replicates
s B2
s A2
s e2
@ Nature reviews & G. Churchill (2002)
19
Replication vs Pooling

mRNA from different samples are often combined
to form a ``pooled-sample’’ or pool. Why?
 If each sample doesn’t yield enough mRNA
 To compensate an excess of variability  ?

Statisticians tend not to like it but pooling
may be OK if properly done
 Combine
several samples in each pool
 Use several pools from different samples
 Do not use pools when individual information is
important (e.g.paired designs)
20
2. Blocking


Assume we wish to perform an experiment to
compare two treatments.
The samples or their processing may not be
homogeneous: There are blocks
 Subjects: Male/Female
 Arrays produced in two

lots (February, March)
If there are systematic differences between
blocks the effects of interest (e.g. tretament)
may be confounded
 Observed
differences are attributable to treatment
effect or to confounding factors?
21
Confounding block with treatment effects
Sample
1
2
3
4
5
6
7
8

Awful design
Treatment Sex
Batch
A
Male
1
A
Male
1
A
Male
1
A
Male
1
B
Female
2
B
Female
2
B
Female
2
B
Female
2
Sample
1
2
3
4
5
6
7
8
Balanced design
Treatment Sex
Batch
A
Male
1
A
Female
2
A
Male
1
A
Female
2
B
Male
1
B
Female
2
B
Male
1
B
Female
2
Two alternative designs to investigate treatment effects


Left: Treatment effects confounded with Sex and Batch effect
Right: Treatments are balanced between blocks


Influence of blocks is automatically compensated
Statistical analysis may separate block from treatment efefect
22
3. Randomisation

Randomly assigning samples to groups to
eliminate unspecific disturbances
 Randomly
assign individuals to treatments.
 Randomise order in which experiments are
performed.
Randomisation required to ensure validity
of statistical procedures.
 Block what you can and randomize what
you cannot

23
Experimental layout
How are mRNA samples assigned to arrays
 The experimental layout has to be chosen
so that the resulting analysis can be done as
efficient and robust as possible

 Sometimes
there is only one reasonable choice
 Sometimes several choices are available
24
Example I: Only one design choice
Case 1: Meaningful biological control (C)
Samples: Liver tissue from 4 mice treated by cholesterol modifying drugs.
Question 1: Genes that respond differently between the T and the C.
Question 2: Genes that responded similarly across two or more
treatments relative to control.
Case 2: Use of universal reference.
Samples: Different tumor samples.
Question: To discover tumor subtypes.
T1
T2
T3
C
T4
T1
T2
Tn-1 Tn
Ref
25
Example 2: a number of different designs
are suitable for use (1)

Time course experiments
 Design
choice depends on the comparisons of
interest
T1
T2
T3
T4
Ref
T1
T2
T3
T4
T1
T2
T3
T4
T1
T2
T3
T4
26
How can we decide?


A-optimality: choosee design which minimizes
variance of estimates of effects of interest
A simple example: Direct vs indirect estimates
Indirect
A
Direct
A
B
average (log (A/B))
s2 /2
R
B
log (A / R) – log (B / R )
2s2
27
Summary

Selection of mRNA samples is important
 Most
important: biological replicates
 Technical replicates also useful, but different
 If needed and possible use pooling wisely

Choice of experimental layout guided by
 The
scientific question
 Experimental design principles
 Efficiency and robustness considerations

Correspondence between experimental
Designs-Linear Models-ANOVA can be exploited
to select model and analyze data
28
Experimental design, Linear Models
and Analysis of the Variance
In experimental design the different
sources of variability influencing the
observed response may be identified.
 These sources can be related with the
response using a linear model
 Analysis of the variance can be used to
separately estimate and test the relative
importance of each source of variability.

29
Statistical methods to
detect differentially
expressed genes
Class comparison: Identifying
differentially expressed genes

Identify genes differentially expressed between
different conditions such as
 Treatment,
cell type,... (qualitative covariates)
 Dose, time, ... (quantitative covariate)
 Survival, infection time,... !

Estimate effects/differences between groups
probably using log-ratios, i.e. the difference on
log scale log(X)-log(Y) [=log(X/Y)]
31
What is a “significant change”?


Depends on the
variability within
groups, which may be
different from gene to
gene.
To assess the
statistical significance
of differences,
conduct a statistical
test for each gene.
32
Different settings for statistical tests

Indirect comparisons: 2 groups, 2 samples, unpaired
 E.g.
10 individuals: 5 suffer diabetes, 5 healthy
 One sample fro each individual
 Typically: Two sample t-test or similar

Direct comparisons: Two groups, two samples, paired
 E.g.
6 individuals with brain stroke.
 Two samples from each: one from healthy (region 1) and
one from affected (region 2).
 Typically: One sample t-test (also called paired t-test) or
similar based on the individual differences between
conditions.
33
Different ways to do the experiment


An experiment
use cDNA arrays
(“two-colour”) or
affy (“one-colour).
Depending on the
technology used
allocation of
conditions to
slides changes.
Type of
chip
cDNA
(2-col)
Affy
(1-col)
Experiment
10 indiv.
Diab (5)
Heal (5)
Reference
design.
(5) Diab/Ref
(5) Heal/Ref
Comparison
design.
(5) Diab vs
(5) Heal
6 indiv.
Region 1
Region 2
6 slides
1 individual
per slide
(6) reg1/reg2
12 slides
(6) Paired
differences
34
“Natural” measures of discrepancy
For Direct comparisons in two colour or paired-one colour.
1
Mean (log) ratio =
nT
nT
 R , (R or M used indistinctly)
i 1
i
Classical t-test = t  ( R) SE , ( SE estimates standard error of R)
Robust t-test = Use robust estimates of location &scale
For Indirect comparisons in two colour or
Direct comparisons in one colour.
1
Mean difference =
nT
nT
1
Ti 

nC
i 1
nC
C
i 1
i
T C
Classical t-test = t  (T  C ) s p 1/ nT  1/ nC
Robust t-test = Use robust estimates of location &scale
35
Some Issues



Can we trust average effect sizes (average difference of
means) alone?
Can we trust the t statistic alone?
Here is evidence that the answer is no.
Gene
A
B
M1
2.5
M2
M3
M4
M5
M6
Mean
SD
t
2.7
2.5
2.8
3.2
2
2.61
0.40
16.10
0.01 0.05
-0.05
0.01
0
0
0.003
0.03
0.25
C
2.5
2.7
2.5
1.8
20
1
5.08
7.34
1.69
D
0.5
0
0.2
0.1
-0.3
0.3
0.13
0.27
1.19
E
0.1 0.11
0.1
0.1
0.11
0.09
0.10
0.01
33.09
Courtesy of Y.H. Yang
36
Some Issues



Can we trust average effect sizes (average difference of
means) alone?
Can we trust the t statistic alone?
Here is evidence that the answer is no.
Gene
A
B
M1
2.5
M2
M3
M4
M5
M6
Mean
SD
t
2.7
2.5
2.8
3.2
2
2.61
0.40
16.10
0.01 0.05
-0.05
0.01
0
0
0.003
0.03
0.25
C
2.5
2.7
2.5
1.8
20
1
5.08
7.34
1.69
D
0.5
0
0.2
0.1
-0.3
0.3
0.13
0.27
1.19
E
0.1 0.11
0.1
0.1
0.11
0.09
0.10
0.01
33.09
•Averages can be driven by outliers.
Courtesy of Y.H. Yang
37
Some Issues



Can we trust average effect sizes (average difference of
means) alone?
Can we trust the t statistic alone?
Here is evidence that the answer is no.
Gene
A
B
M1
2.5
M2
M3
M4
M5
M6
Mean
SD
t
2.7
2.5
2.8
3.2
2
2.61
0.40
16.10
0.01 0.05
-0.05
0.01
0
0
0.003
0.03
0.25
C
2.5
2.7
2.5
1.8
20
1
5.08
7.34
1.69
D
0.5
0
0.2
0.1
-0.3
0.3
0.13
0.27
1.19
E
0.1 0.11
0.1
0.1
0.11
0.09
0.10
0.01
33.09
•t’s can be driven by tiny variances.
Courtesy of Y.H. Yang
38
Variations in t-tests (1)

Let

Rg mean observed log ratio
 SEg standard error of Rg estimated from
data on gene g.
 SE standard error of Rg estimated from
data across all genes.


Global t-test:
Gene-specific t-test
t=Rg/SE
t=Rg/SEg
39
Some pro’s and con’s of t-test
Test
Pro’s
Global t-test: Yields stable
variance
t=Rg/SE
estimate
Gene-specific: Robust to
variance
t=Rg/SEg
heterogeneity
Con’s
Assumes variance
homogeneity 
biased if false
Low power
 Yields unstable
variance estimates
(due to few data)

40
T-tests extensions
SAM
(Tibshirani, 2001)
Regularized-t
(Baldi, 2001)
EB-moderated t
(Smyth, 2003)
S
t
Rg
c  SEg
Rg
v0 SE 2  (n  1) SEg2
v0  n  2
t
Rg
d 0  SE02  d  SEg2
d0  d
41
Up to here…: Can we generate a list of
candidate genes?
With the tools we have, the reasonable steps to generate a
list of candidate genes may be:
Gene 1: M11, M12, …., M1k
Gene 2: M21, M22, …., M2k
…………….
Gene G: MG1, MG2, …., MGk
For every gene, calculate
Si=t(Mi1, Mi2, …., Mik),
e.g. t-statistics, S, B,…
Statistics of interest
S1, S2, …., SG
?
A list of candidate
DE genes
We need an idea of how significant are these values
We’d like to assign them p-values
42
Significance testing
Nominal p-values

After a test statistic is computed, it is convenient
to convert it to a p-value:
The probability that a test statistic, say S(X),
takes values equal or greater than that taken on
the observed sample, say S(X0), under the
assumption that the null hypothesis is true
p=P{S(X)>=S(X0)|H0 true}
44
Significance testing
 Test
of significance at the a level:
Reject
the null hypothesis if your p-value
is smaller than the significance level
It has advantages but not free from
criticisms
 Genes
with p-values falling below a
prescribed level may be regarded as
significant
45
Hypothesis testing overview
for a single gene
Reported decision
H0 is Rejected
(gene is Selected)
State of
the nature
("Truth")
H0 is false
(Affected)
H0 is true
(Not
Affected)
TP, prob: 1-a
H0 is Accepted
(gene not
Selected)
FN, prob: 1-b
Type II error
FP,
P[Rej H0|H0]<= a
Type I error
TN , prob: b
Positive predictive
value
TP/[TP+FP]
Negative predictive
value
TN/[TN+FN]
Sensitiviy
TP/[TP+FN]
Specificity
TN/[TN+FP]
46
Calculation of p-values
 Standard
methods for calculating p-
values:
(i) Refer to a statistical distribution
table (Normal, t, F, …) or
(ii) Perform a permutation analysis
47
(i) Tabulated p-values
Tabulated p-values can be obtained for
standard test statistics (e.g.the t-test)
 They often rely on the assumption of
normally distributed errors in the data
 This assumption can be checked
(approximately) using a

 Histogram
 Q-Q
plot
48
Example
Golub data, 27 ALL vs 11 AML samples, 3051 genes
A t-test yields 1045 genes with p< 0.05
49
(ii) Permutations tests

Based on data shuffling. No assumptions



Repeat for every possible permutation, b=1…B



Random interchange of labels between samples
Estimate p-values for each comparison (gene) by
using the permutation distribution of the t-statistics
Permute the n data points for the gene (x). The first n1 are
referred to as “treatments”, the second n2 as “controls”
For each gene, calculate the corresponding two sample
t-statistic, tb
After all the B permutations are done put
p = #{b: |tb| ≥ |tobserved|}/B
50
Permutation tests (2)
51
Volcano plot : fold change vs log(odds)1
Significant change detected
No change detected
52
Linear models and
Analysis of the Variance to
analyze designed experiments
From experimental design to linear models

Some weaknesses of statistical framework
 What
to do if treatment has more than 2 levels?
 How to deal with more than one treatment or
experimental condition?
 How to deal with nuisance factors such as
batch effects, covariates, etc…?

Most of this can be solved with an
alternative approach: Analysis of the
Variance
54
Multiple testing
How far can we trust the decision?

The test: "Reject H0 if p-val ≤ a"
 is
said to control the type I error because,
under a certain set of assumptions,
the probability of falsely rejecting H0 is less
than a fixed small threshold
 Nothing
is warranted about P[FN]
“Optimal” tests are built trying to minimize this
probability
 In practical situations it is often high

56
What if we wish to test more than one
gene at once? (1)

Consider more than one test at once
 Two
tests each at 5% level. Now probability of
getting a false positive is 1 – 0.95*0.95 = 0.0975
 Three tests  1 – 0.953 =0.1426
 n tests
 1 – 0.95n
 Converge towards 1 as n increases

Small p-values don’t necessarily imply
significance!!!  We are not controlling the
probability of type I error anymore
57
What if we wish to test more than one
gene at once? (2): a simulation

Simulation of this process for 6,000 genes with 8
treatments and 8 controls

All the gene expression values were simulated i.i.d
from a N (0,1) distribution, i.e. NOTHING is
differentially expressed in our simulation

The number of genes falsely rejected will be on the
average of (6000 · a), i.e. if we wanted to reject all
genes with a p-value of less than 1% we would
falsely reject around 60 genes
See example
58
Multiple testing: Counting errors
Decision reported
H0 is Rejected
(Genes Selected)
State of the
nature
("Truth")
H0 is accepted
(Genes not Selected)
Total
H0 is false
(Affected)
ma am0
(S)
(m-mo)(ma am0
(T)
m-mo
H0 is true
(Not
Affected)
am0
(V)
mo-am0
(U)
mo
Ma
(R)
m-ma
(m-R)
m
Total
V = # Type I errors [false positives]
T = # Type II errors [false negatives]
All these quantities could be known if m0 was known
59
How does type I error control extend to
multiple testing situations?
Selecting genes with a p-value less than a
doesn’t control for P[FP] anymore
 What can be done?

 Extend

the idea of type I error
FWER and FDR are two such extensions
 Look
for procedures that control the
probability for these extended error types

Mainly adjust raw p-values
60
Two main error rate extensions

Family Wise Error Rate (FWER)
 FWER
is probability of at least one false
positive
FWER= Pr(# of false discoveries >0) = Pr(V>0)

False Discovery Rate (FDR)
 FDR
is expected value of proportion of false
positives among rejected null hypotheses
FDR = E[V/R; R>0] = E[V/R | R>0]·P[R>0]
61
FDR and FWER controlling procedures

FWER
 Bonferroni
(adj Pvalue = min{n*Pvalue,1})
 Holm (1979)
 Hochberg (1986)
 Westfall & Young (1993) maxT and minP

FDR
 Benjamini
& Hochberg (1995)
 Benjamini & Yekutieli (2001)
62
Difference between controlling
FWER or FDR

FWER Controls for no (0) false positives
 gives
many fewer genes (false positives),
 but you are likely to miss many
 adequate if goal is to identify few genes that differ
between two groups

FDR Controls the proportion of false positives
 if
you can tolerate more false positives
 you will get many fewer false negatives
 adequate if goal is to pursue the study e.g. to
determine functional relationships among genes
63
Steps to generate a list of candidate genes
revisited (2)
Gene 1: M11, M12, …., M1k
Gene 2: M21, M22, …., M2k
…………….
Gene G: MG1, MG2, …., MGk
For every gene, calculate
Si=t(Mi1, Mi2, …., Mik),
e.g. t-statistics, S, B,…
Statistics of interest
S1, S2, …., SG
Assumption on the
null distribution:
data normality
Nominal p-values
P1, P2, …, PG
Adjusted p-values
aP1, aP2, …, aPG
A list of candidate
DE genes
Select genes with
adjusted P-values
smaller than a
64
Example
Golub data, 27 ALL vs 11 AML samples, 3051 genes
Bonferroni adjustment: 98 genes with padj< 0.05 (praw < 0.000016)
65
Extensions

Some issues we have not dealt with
 Replicates
within and between slides
 Several effects: use a linear model
 ANOVA: are the effects equal?
 Time series: selecting genes for trends
Different solutions have been suggested
for each problem
 Still many open questions

66
Examples
Ex. 1- Swirl zebrafish experiment
Swirl is a point mutation causing defects in
the organization of the developing embryo
along its ventral-dorsal axis
 As a result some cell types are reduced
and others are expanded
 A goal of this experiment was to identify
genes with altered expression in the swirl
mutant compared to the wild zebrafish

68
Example 1: Experimental design



Each microarray contained 8848 cDNA probes
(either genes or EST sequences)
4 replicate slides: 2 sets of dye-swap pairs
For each pair, target cDNA of the swirl mutant
was labeled using one of Cy5 or Cy3 and the
target cDNA of the wild type mutant was labeled
using the other dye
2
Wild type
Swirl
2
69
Example 1. Data analysis
Gene expression data on 8848 genes for
4 samples (slides): Each hybridixed with
Mutant and Wild type
 On a gene-per-gene basis this is a onesample problem
 Hypothesis to be tested for each gene:

 H0:

log2(R/G)=0
The decision will be based on average
log-ratios
70
Example 2 . Scanvenger receptor
BI (SR-BI) experiment

Callow et al. (2000). A study of lipid metabolism
and atherosclerosis susceptibility in mice.

Transgenic mice with SR-BI gene overexpressed
have low HDL cholesterol levels.

Goal: To identify genes with altered expression in
the livers of transgenic mice with SR-BI gene
overexpressed mice (T) compared to “normal”
FVB control mice (C).
71
Example 2. Experimental design



8 treatment mice (Ti) and 8 control ones (Ci).
16 hybridizations: liver mRNA from each of the
16 mice (Ti , Ci ) is labelled with Cy5, while
pooled liver mRNA from the control mice (C*) is
labelled with Cy3.
Probes: ~ 6,000 cDNAs (genes), including 200
related to pathogenicity.
T
C
8
8
C*
72
Example 2. Data analysis
Gene expression data on 6348 genes for
16 samples: 8 for treatment (log T/C*) and
8 for control (log (C/C*))
 On a gene-per-gene basis this is a 2
sample problem
 Hypothesis to be tested for each gene:

 H0:

[log (R1/G)-log (R2/G)]=0
Decision will be based on average
difference of log ratios
73
Software for
microarray
data analysis
Introduction

Microarray experiments generate huge
quantities of data which have to be
 Stored,



managed, visualized, processed …
Many options available. However…
No tool satisfies all user’s needs
Trade-off. A tool must be
 Powerful
but user friendly
 Complete but without too many options,
 Flexible but easy to start with and go further
 Available, to date, well documented but affordable
75
So, what you need is “R”?

R is an open-source system for statistical
computation and graphics. It consists of
 A language
 A run-time environment with
 Graphics, a debugger, and
 Access to certain system functions,

It can be used
 Interactively,
through a command language
 Or running programs stored in script files
76
http://www.r-project.org/
77
Some pro’s & con’s



Powerful,
Used by statisticians
Easy to extend





Creating add-on packages
Many already available
Freely available
Unix, windows & Mac
Lot of documentation




Not very easy to learn
Command-based
Documentation
sometimes cryptic
Memory intensive


Worst in windows
Slow at times
We believe the effort is worth the pity!!!
• If you “just want to do statistical analysis”
 Easy to find alternatives
• If you intend to do microarray data analysis
 Probably one of best options
78
R and Microarrays
R is a popular tool between statisticians
 Once they started to work with microarrays
they continued using it

 To
perform the analysis
 To implement new tools
This gave rise very fast to lots of free Rbased software to analyze microarrays
 The Bioconductor project groups many of
these (but not all) developments

79
The Bioconductor project




Open source and open development software
project for the analysis and comprehension of
genomic data.
Most early developments as R packages.
Extensive documentation and training material from
short courses
http://www.bioconductor.org/workshop.html.
Has reached some stability but still evolving !!!
 what is now a standard may not be so in a
future.
80
There's much more than R!

Give a look at
"My microarray software comparison"
http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
81
Examples
Ex. 1- Swirl zebrafish experiment
Swirl is a point mutation causing defects in
the organization of the developing embryo
along its ventral-dorsal axis
 As a result some cell types are reduced
and others are expanded
 A goal of this experiment was to identify
genes with altered expression in the swirl
mutant compared to the wild zebrafish

83
Example 1: Experimental design



Each microarray contained 8848 cDNA probes
(either genes or EST sequences)
4 replicate slides: 2 sets of dye-swap pairs
For each pair, target cDNA of the swirl mutant
was labeled using one of Cy5 or Cy3 and the
target cDNA of the wild type mutant was labeled
using the other dye
2
Wild type
Swirl
2
84
Example 1. Data analysis
Gene expression data on 8848 genes for
4 samples (slides): Each hybridixed with
Mutant and Wild type
 On a gene-per-gene basis this is a onesample problem
 Hypothesis to be tested for each gene:

 H0:

log2(R/G)=0
The decision will be based on average
log-ratios
85
Example 2 . Scanvenger receptor
BI (SR-BI) experiment

Callow et al. (2000). A study of lipid metabolism
and atherosclerosis susceptibility in mice.

Transgenic mice with SR-BI gene overexpressed
have low HDL cholesterol levels.

Goal: To identify genes with altered expression in
the livers of transgenic mice with SR-BI gene
overexpressed mice (T) compared to “normal”
FVB control mice (C).
86
Example 2. Experimental design



8 treatment mice (Ti) and 8 control ones (Ci).
16 hybridizations: liver mRNA from each of the
16 mice (Ti , Ci ) is labelled with Cy5, while
pooled liver mRNA from the control mice (C*) is
labelled with Cy3.
Probes: ~ 6,000 cDNAs (genes), including 200
related to pathogenicity.
T
C
8
8
C*
87
Example 2. Data analysis
Gene expression data on 6348 genes for
16 samples: 8 for treatment (log T/C*) and
8 for control (log (C/C*))
 On a gene-per-gene basis this is a 2
sample problem
 Hypothesis to be tested for each gene:

 H0:

[log (R1/G)-log (R2/G)]=0
Decision will be based on average
difference of log ratios
88