Download TTEST – Between subjects

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Point mutation wikipedia , lookup

Epistasis wikipedia , lookup

Oncogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

X-inactivation wikipedia , lookup

Minimal genome wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Pathogenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

NEDD9 wikipedia , lookup

Public health genomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene therapy wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

The Selfish Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Quick R Tips
• How to find out what packages are available
– library()
• How to find out what packages are actually
installed locally
– (.packages())
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
Image analysis
Normalization
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Microarray
experiment
Microarray Data Flow
Image
Analysis
Unsupervised
Analysis –
clustering
Database
Data Selection & Missing
value estimation
Normalization
& Centering
Supervised
Analysis
Networks &
Data Integration
Data Matrix
Decomposition
techniques
A note about Affymetrix (1-color) pre-processing
within-chip
cross-chip
sequence
specific
background
correction
within-probe set
aggregation of
intensity values
Two “standard” methods
– MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM
probes)
– RMA by Speed group (UC Berkeley) (ignores MM probes)
Why normalize?
 Microarray data have significant systematic variation both
within arrays and between arrays that is not true biological
variation
 Accurate comparison of genes’ relative expression within
and across conditions requires normalization of effects
 Sources of variation:
 Spatial location on the array
 Dye biases which vary with spot intensity
 Plate origin
 Printing/spotting quality
 Experimenter
Normalization – Thoughts
• There are many different ways to normalize
data
– Global median, LOWESS, LOESS, RMA etc
– By print tip, spatial, etc
• BUT: don’t expect it to fix bad data!
– Won’t make up for lack of replicates
– Won’t make up for horrible slides
#Create a boxplot of the normalized data
boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue")
#To save the boxplot as a jpeg file
jpeg("normal_boxplot.jpg")
boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue")
dev.off()
Microarray Data Analysis
(slides used with permission of Dr. John Quackenbush,
Dana Farber –creator of MeV software )
http://www.tm4.org/mev/
Classification:
• Hierarchical clustering
• K-means clustering
Coherence:
• PCA
• Relevance Network
Differential gene expression:
• T-test
• Analysis of Variance
• Significance of Microarray (SAM)
Hierarchical Clustering
• A type of cluster analysis
• There is both “divisive” and “agglomerative”
HC…agglomerative is most commonly used
• Group objects that are “close” to one another
based on some distance/similarity metric
• Clusters are created and linked based on a
metric that evaluates the cluster-to-cluster
distance
• Results are displayed as a dendrogram
Hierarchical Clustering
g1
g2
g3
g4
g5
g6
g7
g8
g1 is most like g8
g1
g8
g2
g3
g4
g5
g6
g7
g4 is most like {g1, g8}
g1
g8
g4
g2
g3
g5
g6
g7
(HCL-2)
Hierarchical Clustering
g1
g8
g4
g2
g3
g5
g6
g7
g5 is most like g7
g1
g8
g4
g2
g3
g5
g7
g6
{g5,g7} is most like {g1, g4, g8}
g1
g8
g4
g5
g7
g2
g3
g6
(HCL-3)
Hierarchical Tree
g1
g8
g4
g5
g7
g2
g3
g6
(HCL-4)
Hierarchical Clustering
During construction of the hierarchy, decisions must be
made to determine which clusters should be joined.
The distance or similarity between clusters must be
calculated. The rules that govern this calculation are
linkage methods.
(HCL-5)
Agglomerative Linkage Methods
Linkage methods are rules or metrics that return a value that
can be used to determine which elements (clusters) should be
linked.
Three linkage methods that are commonly used are:
• Single Linkage
• Average Linkage
• Complete Linkage
(HCL-6)
Comparison of Linkage Methods
Single
Ave.
Complete
(HCL-10)
Bootstrapping (ST)
Bootstrapping – resampling with replacement
Original expression matrix:
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Various bootstrapped matrices (by experiments):
Exp 2
Exp 2
Exp 3
Exp 4
Exp 4
Exp 4
Exp 1
Gene 1
Gene 2
Gene 3
Gene 4
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 5
Gene 6
Exp 1
Exp 3
Exp 5
Exp 5
Exp 6
Jackknifing (ST)
Jackknifing – resampling without replacement
Original expression matrix:
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Various jackknifed matrices (by experiments):
Exp 1
Exp 3
Exp 4
Exp 5
Exp 6
Exp 1
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 6
Exp 2
Exp 3
Exp 4
Exp 6
Analysis of Bootstrapped and Jackknifed Support Trees
•
Bootstrapped or jackknifed expression matrices are created many times by
randomly resampling the original expression matrix, using either the bootstrap
or jackknife procedure.
•
Each time, hierarchical trees are created from the resampled matrices.
•
The trees are compared to the tree obtained from the original data set.
•
The more frequently a given cluster from the original tree is found in the
resampled trees, the stronger the support for the cluster.
•
As each resampled matrix lacks some of the original data, high support for a
cluster means that the clustering is not biased by a small subset of the data.
Hierarchical Clustering in R
Step 1: Data matrix
• First you need a numeric matrix
– Typical array data set will have samples as columns and
genes as rows
– We want to be sure our data are in the form of an expression
matrix
• Use Biobase library/package
• See
http://www.bioconductor.org/packages/2.2/bioc/vignettes/Biobase/ins
t/doc/ExpressionSetIntroduction.pdf
> exprs<-as.matrix(data, header=TRUE, sep="\t", row.names=1, as.is=TRUE)
Step 2: Calculate Distance Matrix
• Default dist() method in R uses rows as the vectors..but we want
the distance between samples….i.e., the columns of our matrix.
• There is a handy package to help us at MD Anderson called
oompaBase
source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite()
oompainstall(groupName="all")
• Once installed, be sure to locally activate the libraries
library(oompaBase)
library(ClassDiscovery)
library(ClassComparison)
• oompaBase also requires the mclust and cobs
packages…download these from CRAN
• Use the function distanceMatrix() to create
a distance matrix of your samples….
– Uses the expression set created in Step 1 as
input
– Remember that there are many different types
of distance metrics to choose from!
– See help(distanceMatrix)
x<- distanceMatrix(exprs,'pearson')
Step 3: Cluster
• Use the hclust() function to create a hierarchical cluster
based on your distance matrix, x, created in Step 2.
> y<-hclust(x,method="complete")
> plot(y)
Testing for Differential Gene
Expression with the T-test
• Get the multtest package from CRAN
• Package contains data from the Golub
leukemia microarray data set (ALL v AML)
– 38 arrays
• 27 from lymphoblastic
• 11 from myeloid
http://people.cryst.bbk.ac.uk/wernisch/maco
•
•
•
•
library(multtest)
data(golub)
golub.cl
Generate the T statistic
– teststat <-mt.teststat(golub, golub.cl)
• Convert into P-values
– rawp0 <-2*pt(abs(teststat),lower.tail=F, df=38-2)
• Correct for multiple testing and show the ten most
significant genes
– procs <-c(“Bonferroni”, “BH”)
– res<-mt.rawp2adjp((rawp0), procs)
– res$adjp[1:10,]
http://people.cryst.bbk.ac.uk/wernisch/maco
K-Means / K-Medians Clustering (KMC)– 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
K-Means Clustering – 2
3. Calculate mean / median expression profile of each cluster.
4. Shuffle genes among clusters such that each gene is now in the cluster
whose mean / median expression profile (calculated in step 3) is the
closest to that gene’s expression profile.
G3
G11
G6
G1
G8
G4
G7
G5
G2
G10
G9
G12
G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around any more,
OR a user-specified number of iterations has been reached.
K-Means / K-Medians is most useful when the user has an a-priori hypothesis
about the number of clusters the genes should group into.
Principal Components (PCAG and PCAE) – 1
1. PCA simplifies the “views” of the data.
2. Suppose we have measurements for each gene on multiple
experiments.
3. Suppose some of the experiments are correlated.
4. PCA will ignore the redundant experiments, and will take a
weighted average of some of the experiments, thus possibly making
the trends in the data more interpretable.
5. The components can be thought of as axes in n-dimensional
space, where n is the number of components. Each axis represents a
different trend in the data.
PCAG and PCAE - 2
x
z
y
“Cloud” of data points (e.g., genes)
in 3-dimensional space
In this example,
Data points resolved along 3 principal
component axes.
x-axis could mean a continuum from over-to under-expression (“blue” and “green”
genes over-expressed, yellow genes under-expressed)
y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in
The remaining expts, while “brown” genes are under-expressed in the first five expts, and
over-expressed in the remaining expts.
z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in
odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true
for “purple” genes.
Interpretation of components is somewhat subjective.
Relevance Networks
Set of genes whose
expression profiles are
predictive of one another.
Can be used to identify
negative correlations
between genes
Genes with low entropy
(least variable across experiments)
are excluded from analysis.
10
H = -Sp(x)log2(p(x))
x=1
Relevance Networks
A
.92
A
.75
.15
.37
E
B
B
E
.02
.28 .63
.51
.40
D
C
D
.11
The expression pattern
of each gene
compared to that of
every other gene.
The ability of each gene to
predict the expression of
each other gene is
assigned a correlation
coefficient
Tmin = 0.50
Tmax = 0.90
Correlation coefficients
outside the boundaries
defined by the minimum and
maximum thresholds are
eliminated.
C
The remaining
relationships between
genes define the
subnets
T-Tests (TTEST) – Between subjects (or unpaired) - 1
1. Assign experiments to two groups, e.g., in the expression matrix
below, assign Experiments 1, 2 and 5 to group A, and
experiments 3, 4 and 6 to group B.
Group A
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Exp 1 Exp 2 Exp 5
Gene 1
Gene 1
Gene 2
Gene 2
Gene 3
Gene 3
Gene 4
Gene 4
Gene 5
Gene 5
Gene 6
Gene 6
2. Question: Is mean expression level of a gene in group A
significantly different from mean expression level in group B?
Group B
Exp 3
Exp 4 Exp 6
TTEST – Between subjects - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic for each gene either
from:
A. Theoretical t-distribution
OR
B. Permutation tests.
TTEST - Between subjects - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene between groups A and B,
such that the reshuffled groups A and B respectively have the same
number of elements as the original groups A and B.
Group A
Group B
Exp 1 Exp 2 Exp 5
Exp 3
Exp 4 Exp 6
Original grouping
Gene 1
Group A
Exp 3 Exp 2
Gene 1
Group B
Exp 6
Exp 4 Exp 5 Exp 1
Randomized grouping
TTEST - Between subjects - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized gene
iv) Repeat steps i-iii n times (where n is specified by the user).
v) Let x = the number of times the absolute value of the original
t-statistic exceeds the absolute values of the randomized t-statistic
over n randomizations.
vi) Then, the p-value associated with the gene = 1 – (x/n)
TTEST - Between subjects - 5
5. Determine whether a gene’s expression levels are significantly
different between the two groups by one of three methods:
A) Just alpha: If the calculated p-value for a gene is less than
or equal to the user-input alpha (critical p-value), the gene is
considered significant.
OR
Use Bonferroni corrections to reduce the probability of
erroneously classifying non-significant genes as significant.
B) Standard Bonferroni correction: The user-input alpha is divided
by the total number of genes to give a critical p-value that is used
as above.
TTEST - Between subjects – 6
5C) Adjusted Bonferroni:
i) The t-values for all the genes are ranked in descending
order.
ii) For the gene with the highest t-value, the critical p-value
becomes (alpha / N), where N is the total number of genes; for the
gene with the second-highest t-value, the critical p-value will be
(alpha/ N-1), and so on.
The problem of multiple testing
(adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics,
Dept. Computational Molecular Biology, Berlin, Germany
http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf)
• Let’s imagine there are 10,000 genes on a chip, AND
• None of them is differentially expressed.
• Suppose we use a statistical test for differential
expression, where we consider a gene to be
differentially expressed if it meets the criterion at a
p-value of p < 0.05.
The problem of multiple testing – 2
• Let’s say that applying this test to gene “G1” yields a pvalue of p = 0.01
• Remember that a p-value of 0.01 means that there is a
1% chance that the gene is not differentially expressed,
i.e.,
• Even though we conclude that the gene is differentially
expressed (because p < 0.05), there is a 1% chance that
our conclusion is wrong.
• We might be willing to live with such a low probability
of being wrong
BUT .....
The problem of multiple testing – 3
• We are testing 10,000 genes, not just one!!!
• Even though none of the genes is differentially
expressed, about 5% of the genes (i.e., 500 genes) will be
erroneously concluded to be differentially expressed,
because we have decided to “live with” a p-value of 0.05
• If only one gene were being studied, a 5% margin of
error might not be a big deal, but 500 false conclusions
in one study? That doesn’t sound too good.
The problem of multiple testing - 4
• There are “tricks” we can use to reduce the severity of
this problem.
• They all involve “slashing” the p-value for each test
(i.e., gene), so that while the critical p-value for the entire
data set might still equal 0.05, each gene will be
evaluated at a lower p-value.
• Don’t get too hung up on p-values.
• Ultimately, what matters is biological relevance.
P-values should help you evaluate the strength of the
evidence, rather than being used as an absolute yardstick
of significance. Statistical significance is not necessarily
the same as biological significance.
Significance analysis of microarrays (SAM)
•
SAM can be used to pick out significant genes
based on differential expression between sets
of samples.
SAM -2
• SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to have
been wrongly identified by chance as being significant.
• It is a very interactive algorithm – allows users to
dynamically change thresholds for significance
(through the tuning parameter delta) after looking at
the distribution of the test statistic.
• The ability to dynamically alter the input parameters
based on immediate visual feedback, even before
completing the analysis, should make the data-mining
process more sensitive.
SAM designs
Two-class unpaired: to pick out genes whose mean
expression level is significantly different
between two groups of samples (analogous to
between subjects t-test).
Two-class paired: samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group
B (analogous to paired t-test).
SAM designs - 2
Multi-class: picks up genes whose mean expression is
different across > 2 groups of samples (analogous to
one-way ANOVA)
Censored survival: picks up genes whose expression
levels are correlated with duration of survival.
One-class: picks up genes whose mean expression
across experiments is different from a user-specified
mean.
SAM Two-Class Unpaired– 4
“Observed d = expected d” line
Significant positive genes
(i.e., mean expression of group B >
mean expression of group A) in red
Tuning parameter
“delta” limits, can
be dynamically
changed by using
the slider bar or
entering a value in
the text field.
Significant negative genes
(i.e., mean expression of group A > mean
expression of group B) in green
The more a gene deviates from the “observed = expected” line, the
more likely it is to be significant. Any gene beyond the first gene in the
+ve or –ve direction on the x-axis (including the first gene), whose
observed exceeds the expected by at least delta, is considered
significant.