Download Hypothesis Testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ronald Fisher wikipedia , lookup

Probability box wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Basic of Probability Theory for Ph.D. students in Education, Social
Sciences and Business
(Shing On LEUNG and Hui Ping WU)
(May 2015)
This is a series of 3 talks respectively on:
A. Probability Theory
B. Hypothesis Testing
C. Bayesian Inference
Lecture 2: Hypothesis Testing
1
B. Hypothesis Testing
 Probability theory only gives us a basic framework. How can we
proceed to make decisions? Classical Vs Bayesian
 Null Hypothesis: H0: score of boys = girls (say)
Null Hypothesis
TRUE
Wrong
Decision Accept Correct Type II error
Type I
Reject
error
Correct
(Power)
2
 Step 1: Construct a test statistics (t-value for t-test)
 Step 2: Get a null distribution of that test statistics (e.g.
t-distribution, or Normal)
 Step 3: Construct a critical region (5%) under H0
 Decision rules: Under H, create a critical region (usually 5%), if the
data (summarized by a test statistics) fall into this region, we reject,
and conclude the alternative (non-H0) is true.
3
Hypothesis Testing and Courtroom trial
 Null Hypothesis = Hypothesis of innocence, assume defendant is
innocent first
 The defendant is convicted only if there is enough charging evidence.
 The hypothesis of innocence is only rejected when an error is very
unlikely (i.e. 5% or even less)
 But, the error of the second kind (acquitting a person who committed
the crime) (accepting a wrong hypothesis), can be quite large.
Please refer to the following link for details
http://en.wikipedia.org/wiki/Statistical_hypothesis_testing
4
The p-value
 Before the computer age, judge whether the observation fall or not
fall into the critical region (having a pre-computed table to match).
So, > or < 5%, or > or < 1%, > or < 0.1%, etc.
 Now, with computer, we can compute the observed significant value,
the p-value (or p)
 If the p < 0.05, reject H0 (as the chance is rare)
 But, if p > 0.05, we cannot say we accept H0, because …
 If p = 0.06, what would you say? Marginal significant!
 Some regards p between 0.1 and 0.05 as marginal significant, but no
consensus.
 Better look at the exact value of p, be statistical sensitive
5
A good statistician should not think binary (extreme opposite), i.e.
black vs white, good vs bad, etc, but a matter of degree (otherwise,
you are not good!)
Statistics provides analysis (or indicators) to make decisions, not
making decisions.
6
Types of Hypothesis
 There can be many (infinite) hypothesis
 We just highlight those commonly encountered in Education,
Business and Social Sciences
Study of relation / differences
H0: No relation / no differences
H1: otherwise, i.e. relation and differences
 t-test, ANOVA, correlation r=0, etc.
 Please refer to other sources
 These are well established test with (i) test statistics, (ii) null
distributions and, of course (iii) critical region
 The above are only popular simple examples in Hypothesis Testing
 But not necessary for other cases
7
Review procedure in Hypothesis “generally”
Step 1: Construct a test statistics (can be difficult)
Step 2: Get a null distribution of that test statistics (most difficult)
Step 3: Decision, via construct a critical region (5%)
8
t-test (complication behind t-test) (please refer to other sources)
 t-test is correct, but not T-test
 Step 1: construct the statistics,
t
 Step 2: (i) variance known (normal distribution) (not realistic),
(ii) variance unknown (t-distribution) (realistic) pdf,
f(x) =
 Of course, when N is large (say N > 30), t can be approximated by
Normal, but not other distributions
 Step 3: Decision, easy
 But, computers do all for you
9
Even for Hypothesis Testing with simple t-test (which is the
simplest), there are some complication behind. Others are much
more complicated.
10
Nuisance parameters (nuisance but can be important)
 In t-test, we are interested in µ, but σ is unknown. σ makes things
complicated, and is called nuisance parameter
 There can be many nuisance parameters. For example, in EFA or
CFA, we want to confirm, say, 3 factors with 30 variables. Number
of nuisance parameters are at least 90 (=30x3) and 30 (=10x3) for
factor loadings in EFA and CFA respectively! We haven’t yet
counted the variances! We are not interested in particular values of
parameters, but just want to confirm the factors.
 Number of parameters matter very much in complex modeling, say
EFA, CFA, SEM, etc. Hence, likelihood ratio test (LRT), parametric
bootstrapping, Bayesian analysis, are used.
11
Hypothesis Testing for model fitting
H0: A specific model fit
H1: otherwise
How to get (i) a test statistics, and (ii) null distribution
Likelihood and Likelihood Ratio Test (LRT)
Likelihood
L(θ/x) = Pr(x/θ)
 Usually, θ is not a single value (scalar), but many values (vector)
 Pr(x/θ) is the chance of the outcome given parameters. For example,
given that a coin is fair (Pr(θ=0.5) and we flow it 10 times, what is
the chance of getting at least 5 Heads, (Pr(x>=5/θ=0.5)) etc.
 Likelihood (not probability) is a function of parameters (θ) given our
data (x). If we flow a coin 10 times and observed 6 Heads, what is
the most possible value of θ (θ=chance of getting a Head). This is
12
not a probability.
 Two are mathematically (or numerically) the same but roles are
different
Maximum Likelihood Estimation (MLE or mle)
 This refers to finding out the most possible value of parameters θ,
given data x
 And, this is done by maximizing L with respect to θ. What values of
θ that gives a maximum value of L?
 It is one (popular) method of parameter estimations
 It is an parameter estimation method, not a test
13
Likelihood Ratio Test (LRT) (It is a test)
H0: θ=θ0, Model 1 fit
H1: θ=θ1, Model 2 fit (say)
 LRT is to compare H0 vs H1.
 LR (likelihood ratio) = L(θ0/x) / L(θ1/x)
 Usually we take the logarithm, ratio -> differences, log-likelihood
ratio statistics
 Usually, θ0 fixes some parameter values, say 0 (e.g. correlation
between two variable is zero, or factor loading equal to zero in CFA,
etc)
 And, θ1 takes the most possible values where parameters is not fixed
(e.g. MLE)
 So, usually, null is less likely than alternative, and the ratio is smaller
than 1.
 If the ratio is too small, null is less likely than the alternative, we
reject null. Too small = 5%.
14
 Step 1: test statistics is constructed, but, step 2, the distribution?
 Usually, not known, quite complicated
 If N is large (plus other “regularity conditions”), - 2 log (LR) ~ χ2 (df),
df = degree of free = difference of parameters between null and
alternative
 A big “if”
http://en.wikipedia.org/wiki/Likelihood-ratio_test
 Other than LRT, there are other tests, but usually complicated and
also assume N is large, etc.
 Other ways, (i) parametric bootstrapping, (ii) Bayesian approach
15
Asymptotic vs exact p-values
 Usually, asymptotic (approximate) p-values are used, usually
assuming Normal
 In some simple classroom problems, e.g. tossing a coin, some exact
tests are provided, but not for complicated problems
 Parametric bootstrapping provides computer-generated p-values,
which is close to the exact (or may be the best human being can do!)
16
Parametric bootstrapping
Step 1: Have a real data
Step 2: Estimate parameters (e.g. μ in Normal, or factor loadings in
EFA or CFA)
Step 3: Compute a statistics (t-test, or fit statistics of EFA, CFA, etc)
*step 4, 5, 6 are to be repeated many times*
Step 4: (*repeat) Generate a data from the model in Step 2
Step 5: (*repeat) Estimate parameters for data in Step 4 (as if Step 2 to
Step 1) (most time consuming)
Step 6: (*repeat) Repeat Step 3 but treat data in Step 4 and parameters
in Step 5
Step 7: Repeat Step 4 to 6 to form an empirical null distribution
Step 8: Compare the statistics in Step 3 and null in Step 7
(search for “parametric bootstrap” or otherwise or ask me later)
17
 A computational intensive method
Common Problem for most Ph.D. students
 If this specific model fit, it doesn't imply other models don't fit
(common to parametric bootstrapping and Bayesian Inference)
 And, fit vs don’t fit, in many cases, is a matter of degree (unless is
p-value is >0.8, or < 0.05)
18
Classical Pr(X/θ) Vs Bayesian Pr(θ/X)
(Next lecture on Bayesian)
Q&A
Shing On LEUNG
[email protected]
Hui Ping WU
[email protected]
19