Download TestOfHypothesis - Asia University, Taiwan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Psychometrics wikipedia, lookup

Foundations of statistics wikipedia, lookup

Statistical hypothesis testing wikipedia, lookup

Transcript
Probability Distributions and
Test of Hypothesis
Ka-Lok Ng
Dept. of Bioinformatics
Asia University
Normal Distribution
Normal Distribution
• Distribution of a random variable
• Statistical parameters – m and s
Central Limit Theorem
• Considered the following set of measurements for a given
population: 55.20, 18.06, 28.16, 44.14, 61.61, 4.88, 180.29, 399.11,
97.47, 56.89, 271.95, 365.29, 807.80, 9.98, 82.73. The population
mean is 165.570.
• Now, considered two samples from this population.
• These two different samples could have means very different from
each other and also very different from the true population mean.
• What happen if we considered, not only two samples, but all
possible samples of the same size ?
• The answer to this question is one of the most fascinating facts in
statistics – Central limit theorem.
• It turns out that if we calculate the mean of each sample, those
mean values tend to be distributed as a normal distribution,
independently on the original distribution. The mean of this new
distribution of the means is exactly the mean of the original
population and the variance of the new distribution is reduced by a
factor equal to the sample size n.
Central Limit Theorem
•
•
•
•
When sampling from a population with mean m and variance s, the
distribution of the sample mean (or the sampling distribution X) will have the
following properties:
The distribution of X will be approximately normal. The larger the sample is ,
the more will the sampling distribution resemble the normal distribution.
The mean x of the distribution of X will be equal to m, the mean of the
population from which the samples were drawn.
The variance s2 of distribution X will be equal to s2/n, the variance of the
original population of X divided by the sample size. The quantity s is called
the standard error of the mean.
http://cnx.org/content/m11131/latest/
http://www.riskglossary.com/link/central_limit_theorem.htm
http://www.indiana.edu/~jkkteach/P553/goals.html
Statistical hypothesis testing
•
The expression level of a gene in a given condition is measured
several times. A mean x of these measurements is calculated.
From many previous experiments, it is known that the mean
expression level of the given gene in normal conditions is m. How
can you decide which genes are significantly regulated in a
microarray experiment? For instance, one can apply an arbitrary
cutoff such as a threshold of at least twofold up or down regulation.
One can formulate the following hypotheses:
1. The gene is up-regulated in the condition under study: x>m
2. The gene is down-regulated in the condition under study: x<m
3. The gene is unchanged in the condition under study: x=m
4. Something has gone awry during the lab experiments and the
genes measurements are completely off; the mean of the
measurements may be higher or lower than the normal: x≠m.
Statistical hypothesis testing
When a hypothesis test is viewed as a decision procedure,
two types of error are possible, depending on which
hypothesis, H0 or H1, is actually true. If a test rejects H0
(and accept H1) when H0 is true, it is called a type I error.
If a test fails to reject H0 when H1 is true, it is called a
type II error. The following shows the results of the
different decisions.
Decision
H0
Do not reject H0
Reject H0
True
Correct decision
Type I error
False
Type II error
Correct decision
Statistical hypothesis testing
•
•
•
•
•
•
•
•
The next step is to generate two hypotheses. The two hypotheses must be
mutually exclusive and all inclusive.
Mutually exclusive – the two hypotheses cannot be true both at the same time
All inclusive means that their union has to cover all possibilities
Expression ratios are converted into probability values to test the hypothesis
that particular genes are significantly regulated
Null hypothesis H0 that there is no difference in signal intensity across the
conditions being tested
The other hypothesis (called alternate or research hypothesis) named H1. If
we believe that the gene is up-regulated, the research hypothesis will be
H1: x > m, The null hypothesis has to be mutually exclusive and also has to
include all other possibilities, therefore, the null hypothesis will be H0: x≦ m.
One assigns a p-value for testing the hypothesis. The p-value is the probability
of a measurement more extreme than a certain threshold occurring just by
chance.
The probability of rejecting the null hypothesis when it is true is the
significance level a , which is typically set at p<0.05, in other words we
accept that 1 in 20 cases our conclusion can be wrong.
Statistical hypothesis testing
One-tail testing
• The alternative hypothesis specifies that the parameter is
greater than the values specified under H0, e.g. H1: m>15.
such a hypothesis is called upper one-tail testing.
Example
• The expression level of a gene is measured 4 times in a
given condition. The 4 measurements are used to
calculate a mean expression level of x=90. it is known from
the literature that the mean expression level of the given
gene, measured with the same technology in normal
conditions is m=100 and the standard deviation is s=10.
We expect the gene to be down-regulated in the condition
under study and we would like to test whether the data
support this assumption.
• The alternative hypothesis H1 is “the gene is downregulated” or
H0: x≧m, therefore, H1 x<m
• This is an example of a one-tail hypothesis (left-tail) in
which we expect the values to be in one particular tail of
the distribution.
Accept H0
Statistical hypothesis testing
•
•
•
•
•
•
From the sampling theorem, the means of samples are
distributed approximately as a normal distribution.
Sample size = 4, Mean x = 90, m = 100
Standard deviation s = 10
Assuming a significance level of 5%
The null hypothesis is rejected if the computed p-value is
lower than the critical value (0.05)
We can calculate the value of Z as
Z
xm
90  100

 2
s/ n
10 / 4
The probability of having such a value just by chance, i.e. the p-value, is :
P(Z < -2) = 0.02275
The computed p-value is lower than our significance threshold 0.02275 < 0.05,
therefore we reject the null hypothesis. In other words, we accept the alternate
hypothesis. We stated that “the gene is down-regulated at 5% significance
level”.
This will be understood by the knowledgeable reader as a conclusion that is
wrong in 5% of the cases or fewer.
Normal distribution
table
Normal distribution table
NORMDIST - Area under the curve start from left hand side
Z=0
Z=2
Statistical hypothesis testing
Two-tail testing
• A novel gene has just been discovered. A
large number of expression experiments
measured the mean expression level of
this gene as 100 with a standard deviation
of 10. Subsequently, the same gene is
measured 4 times in 4 cancer patients.
The mean of these 4 measurements is
109. Can we conclude that this gene is
differential expressed in cancer?
• We do not whether the gene will be upregulated or down-regulated.
• Null hypothesis H0: X = 100,
• Alternative hypothesis H1: X ≠ 100
• At a significant level of 5%  2.5% for the
left tail and 2.5% for the right tail
• Z = (109 – 100)/(10/√4) = 9/(10)*2 = 1.8
• P-value, P(Z≧1.8) = 1 – P(Z≦1.8) = 1 –
0.9641 = 0.0359 > 0.025  that is the Pvalue is higher than the significant level,
so we cannot reject the null hypothesis
X
2.5%
2.5%
Tests involving the mean – the t distribution
• Hypothesis testing
• Parametric testing – where the data are known or
assumed to follow a certain probability distribution (e.g.
normal distribution)
• Non-parametric testing – where no a priori knowledge is
available and no such assumptions are made.
• The t distribution test or student’s t distribution test is a
parametric test, it was discovered by William S. Gossett,
a 32-year old research chemist employed by the famous
Irish brewery (釀造,如啤酒) Guinness.
Tests involving the mean – the t distribution
•
Tests involving a single sample may focus on the mean
of the sample (t-test, where variance of the population is
not known) and the variance (c2-test). The following
hypotheses may be formulated if the testing regards the
mean of the sample:
1. H0: m = c, H1: m≠c
2. H0: m≧c, H1: m<c
3. H0: m≦c, H1: m>c
• The first hypotheses corresponds to a two-tail testing in
which no a prior knowledge is available, while the
second and the third correspond to a one-tail testing in
which the measured value c is expected to be higher
and lower than the population mean m, respectively.
Tests involving the mean – the t distribution
•
•
•
•
The expression level of a gene is known to have a mean expression level of
18 in the normal human population. The following expression values have
been obtained in five measurements: 21, 18, 23, 20, 18. Is this data
consistent with the published mean of 18 at a 5% significant level?
Population s.d. s is not known  t-test, calculate sample s.d. s to estimate s
H0 : x = m , H1 : x ≠ m  18  two-tail test
Calculate the t-test statistics
t
xm
20  18

 2.11
s
2.12 / 5
n
Remember using n-1 when calculating standard deviation s.
Tests involving the mean – the t distribution
t-distribution
is symmetric
Degree of freedom, n, n=5-1=4. Using a table of the t-distribution with four degree of
freedom, the p-value associated with this test statistic is found to be between 0.05
and 0.1. The 5% two-tail test corresponds to a critical value of 2.776. Since the pvalue is greater than 0.05 (t-value=2.11 < critical value=2.776), the evidence is not
strong enough to reject the null hypothesis of mean 18  accept H0.
The t-distribution
table
- cumulative
probability starting
from left hand side
Two-tails
a=0.10, 0.05
The t-distribution table
– Excel – TINV gives the two-tails critical value
Two-tails
Tests involving the mean – the t distribution
The expression level of a gene is known to have a mean expression
level of 225 in the normal human population. The expression
values have been obtained in sixteen measurements, in which the
sample mean and s.d. are found to be 241.5 and 98.7259
respectively. Is this data higher than the published mean at a 5%
significant level?
• This is a right-hand one-tail test
• Null hypothesis H0: x≦m=225
• alternative hypothesis H1: x>m=225
• t-score = (241.5-225)/[98.7259/sqrt(16)] = 0.6685
• Degree of freedom = 15
• The 5% level corresponds to a critical value (t0.05(15)) of 1.753
• The t-score is less than the critical value, i.e. 0.6685 < 1.753.
• Based on the critical value, we can accept the null hypothesis.
• The gene expression data set is not higher than the published
mean of 225 at a 5% significant level
Tests involving the variance – the chi-square distribution
The expression level of a gene is known to have a variance s2 = 5000 in the normal human
population. The same gene is measured 26 times and found to have a s2 = 9200 . Is
there evidence that the new measurement different from the population at a 2%
significant level?
• Unknown population mean, c2 test
• Null hypotheses H0: s2 = s2 = 5000, that is the new measured variance is not different
from the population s
• The alternative hypotheses H1: s2 ≠ s2 = 5000 (two-tail test)
• The new variable of score is
c2 
•
(n  1) s 2
s2
This variable with the interesting that if all possible samples of size n are drawn from a
normal population with a variance s2 and for each such sample the quantity is computed,
these value will always form the same distribution. This distribution will be a sample
distribution called a c2 (chi-square) distribution.
p=0.99
p=0.01
two-tail test
reject H0
accept H0
reject H0
Tests involving the variance – the chi-square distribution
•
•
•
If the sample s.d. s is close to the population s.d. s, the value of c2 will be close to n-1
(degree of freedom)
If the sample s.d. s is very different to the population s.d. s, the value of c2 will be very
different from n-1
Let us use the c2 distribution to solve the above problem.
c 
2
•
•
•
•
•
(n  1) s 2
s2

(26  1)9200
 46
5000
http://commons.bcit.ca/math/faculty/david_sabo/apples/math2441/section8/onevarianc
e/chisqtable/chisqtable.htm
The critical values for c20.01(25) = 44.314 and c20.99(25) = 11.524 (right-hand tail)
Reject areas are c2 ≦ 11.524 or c2≧ 44.313
Since 46 > 44.313  reject null hypothesis
The measurement is different from the population at a 2% significant level
The chi-square distribution
Excel - CHIINV,
uses right hand
tail
Tests involving the variance – the chi-square distribution
The expression level of a gene is known to follow normal distribution and have
a standard deviation (s.d.) of no more than 5 in the normal human
population. The same gene is measured 9 times and found to has a s.d. of
7. Is this data set has a sample variance higher than the published
variance at a 5% significant level?
•
This is a left-hand one-tail test
•
Null hypothesis H0: s2 ≦ 25
•
Alternative hypothesis H1 : s2 > 25
•
c2= (9-1)*49/25 = 15.68
•
Degree of freedom = 8
•
The 5% level corresponds to a critical value of 15.507
•
The c2 value is larger than the critical value 15.507
•
Based on the critical value, we can reject the null hypothesis.
•
The gene does has a s.d. higher than the published value 5 at a 5%
significant level.
Tests involving two samples – comparing means
The gene expression level of the gene AC002378 is measured for the patients
and controls are given in the following:
geneID P1
P2
P3
P4
P5
P6
AC002378
0.66
0.51
1.12
0.83
0.91
0.50
geneID C1
C2
C3
C4
C5
C6
AC002378
0.41
0.57
-0.17
0.50
0.22
0.71
• H0: mP = mC, H1: mP ≠ mC
• Mean of gene expression level of patients, XP = 0.755
• Mean of gene expression level of controls, XC = 0.373
• sP2 = 0.059, sC2 = 0.097
• To test whether the two samples have the same variance or not, we perform
the F-test at a 5% level
• F = 0.059/0.097 = 0.60, d.o.f. = 10
• F0.025(6,6) = 5.8198, F0.975(6,6) = 0.17183
• In between 0.17183 and 5.8198  accept the null hypothesis  the
patients and controls have the same variances
Tests involving two samples – comparing means
• t-statistic of two independent samples with equal variances
• The t-score is
t
• where
s
2
pool
( X P  X C )  ( m P  m C ) (0.755  0.373)  0

 2.359
1
1
1
1
s 2pool (

)
0.078(  )
nP
nC
6 6
(nP  1) sP2  (nC  1) sC2
(6  1)0.059  (6  1)0.097


 0.078
nP  nC  2
662
• the p-value, or the probability of having such a value by chance is
0.0400. This value is smaller than the significant level 0.05, and
therefore we accept the null hypothesis, the gene AC002378 is
expressed differently between cancer patients and healthy subjects.