Download lecture

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Analysis of Differential
Expression
T-test
ANOVA
Non-parametric methods
Correlation
Regression
Research Question

Do nicotine-exposed rats have different X
gene expression than control rats in ventral
tegmental area?




Design an experiment in which treatment rats
(N>2) are exposed to nicotine and control rats
(N>2) are exposed to saline.
Collect RNA from VTA, convert to cDNA
Determine the amount of X transcript in each
individual.
Perform a test of means considering the
variability within each group.
Observed difference between
groups

May be due to


Treatment
Chance
Hypothesis Testing


Null hypothesis: There is no difference
between the means of the groups.
Alternative hypothesis: Means of the
groups are different.
Hypothesis testing

You can not accept null hypothesis


You can reject it
You can support it
P-value

The ‘P’ stands for probability, and
measures how likely it is that any
observed difference between groups is
due to chance, alone.
P-value




there is a significant difference between
groups if the P value is small enough (e.g.,
<0.05).
P value equals to the probability of type I
error.
Type I error: wrongly concluding that there
is a difference between groups (false
positive).
Type II error: wrongly concluding that there
is no difference between groups (false
negative).
Multiple tests on the same
data



Expression data on multiple genes
from the same individuals
Subsets of genes are coregulated thus
they are not independent.
Such data requires multiple tests.
Why not do multiple t-tests? Or if
you do, adjust the p-values





Because it increases type I error:
a study involving four treatments, there are
six possible pairwise comparisons.
If the chance of a type I error in one such
comparison is 0.05, then the chance of not
committing a type I error is 1 – 0.05 = 0.95.
then the chance of not committing a type I
error in any one of them is 0.956 = 0.74.
Cumulative type I error = 1-0.74=0.26
Normal Distribution

it is entirely defined by two quantities:
its mean and its standard deviation
(SD).


The mean determines where the peak
occurs and
the SD determines the shape of the
curve.
Curves: same mean, different
stds
Rules of normal distribution



68.3% of the distribution falls within 1
SD of the mean (i.e. between mean –
SD and mean + SD);
95.4% of the distribution falls between
mean – 2 SD and mean + 2 SD;
99.7% of the distribution falls between
mean – 3 SD and mean + 3 SD.
Most commonly used rule


95% of the distribution
falls between mean –
1.96 SD and mean +
1.96 SD
If the data are normally
distributed, one can use
a range (confidence
interval) within which
95% of the data falls into.
A sample




Samples vary
Samples are collected in limited
numbers
They are representatives of a
population.
A sample:

E.g., nicotine treated rat RNA
Sample means



Consider all possible samples of fixed
size (n) drawn from a population.
Each of these samples has its own
mean and these means will vary
between samples.
Each sample will have their own
distribution, thus their own std.
Population mean


The mean of all the sample means is
equal to the population mean ().
SD of the sample means measures the
deviation of individual sample means
from the population mean ()
Standard error

It reflects the effect of sample size,
larger the SE, either the variation is
high or sample size is small.
Confidence Intervals


a confidence interval gives a range of
values within which it is likely that the
true population value lies.
It is defined as follows:


95% confidence interval (sample mean –
1.96 SE) to (sample mean + 1.96 SE).
a 99% confidence interval (calculated as
mean ± 2.56 SE)
T-distribution


The t-distribution is similar in shape to the
Normal distribution, being symmetrical and
unimodal, but is generally more spread out
with longer tails.
The exact shape depends on a quantity
known as the ‘degrees of freedom’, which in
this context is equal to

the sample size minus 1.
T-distribution
One-sample t-test





Null hypothesis: Sample mean does not differ from
hypothesized mean, e.g., 0 (Ho: =0)
A t-statistics (t) is calculated.
t is the number of SEs that separate the sample mean from
the hypothesized value.
The associated P value is obtained by comparison with the t
distribution.
Larger the t-statistics, lower the probability of obtaining such a
large value, thus p is smaller and more significant.
Paired t-test


Used with paired data.
Paired data arise in a number of
different situations,


a matched case–control study in which
individual cases and controls are
matched to each other, or
A repeat measures study in which some
measurement is made on the same set of
individuals on more than one occasion
Paired t-test
Two-sample t-test

Comparison of two groups with
unpaired data.


E.g., comparison of individuals of
treatment and those of control for a
particular variable.
Now there are two independent
populations thus two STDs
Calculation of pooled STD

The pooled SD for the difference in
means is calculated as follows:
Calculation of pooled SE

the combined SE gives more weight to the larger
sample size (if sample sizes are unequal) because
this is likely to be more reliable. The pooled SD for
the difference in means is calculated as follows:
Two sample T-test

Comparison of means of two groups
based on a t-statistics and its student’s
t-distribution.

dividing the difference between the
sample means by the standard error of
the difference.
T-statistic


A P value may be obtained by comparison with the t
distribution on n1 + n2 – 2 degrees of freedom.
Again, the larger the t statistic, the smaller the P
value will be.
Example
X-gene exprs. Tumor
Control
# of samples
119
117
Mean
81
95
Std
18
19
Calculation of SD
Calculation of SE
T-statistic


t = (95-81)/2.41 = 14/2.41 = 5.81,
with a corresponding P value less than 0.0001.

Reject null hypothesis that states that sample means do
not differ.
Analysis of Variance

ANOVA

A technique for analyzing the way in
which the mean of a variable is affected
by different types and combinations of
factors.

E.g., the effect of three different diets on total
serum cholesterol
Sample Experiment
Variance:
Sum of squares calculations
between
within
total
Degrees of freedom
Sources of variation
P value of 0.0039 means that at least two of the treatment groups are different.
Multiple Tests



Post hoc comparisons between pairs
of treatments.
Overall type I error rate increases by
increasing number of pairwise
comparisons.
One has to maintain the 0.05 type I
error rate after all of the comparisons.
Bonferroni Adjustment


0.05/#of tests
Too conservative
NonParametric methods

Many statistical methods require
assumptions.



T-test requires samples are normally
distributed.
They require transformations
Nonparametric methods require very
little or no assumptions.
Wilcoxon signed rank test for
paired data
Wilcoxon signed rank test
Central venous oxygen saturation on
admission and after 6 h into ICU.

Take the
difference
between the
paired data
points.

Patients have
SvO2 values
on admission
and after 6
hours.
Central venous oxygen saturation on
admission and after 6 h into ICU.

Rank
differences
regardless of
their sign.

Give a sign
to the ranked
differences
Calculate


Sum of positive
ranks (R+)
Sum of negative
ranks (R-)
Sum of positive and negative
ranks
Critical values for WSR test
when n = 10
5
Wilcoxon sum or MannWhitney test


Wilcoxon signed rank is good for
paired data.
For unpaired data, wilcoxon sum test
is used.
Steps of Wilcoxon rank-sum
test
Total drug doses in patients with a 3
to 5 day stay in intensive care unit.




Rank all observations
in the increasing
order regardless of
groupings
Use average rank if
the values tie
Add up the ranks
Select the smaller
value, calculate a pvalue for it.
Critical values
Correlation and Regression



Correlation quantifies the strength of the
relationship between two paired samples.
Regression expresses the relationship in the
form of an equation.
Example: whether two genes, X and Y are
coregulated, or the expression level of gene
X can be predicted based on the expression
level of gene Y.
Product moment correlation
r lies between -1 and +1
Age and urea for 20 patients in
emergency unit
Scattergram
r = 0.62
Confidence intervals around r
Confidence of r
Misuse of correlation



There may be a third variable both of
the variables are related to
It does not imply causation.
A nonlinear relationship may exist.
Regression
Method of least squares



The regression line is obtained using the
method of least squares. Any line y = a + bx
that we draw through the points gives a
predicted or fitted value of y for each value of
x in the dataset.
For a particular value of x the vertical
difference between the observed and the fitted
value of y is known as the deviation or
residual.
The method least squares finds the values a
and b that minimizes the sum of squares of all
deviations.
Age and urea level
Residuals
Method of least squares