Download Business Analytics II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Managerial Economics &
Decision Sciences Department
Developed for
business analytics II
week 1
week 2
week 3
▌statistical models: hypotheses, tests & confidence intervals
hypotheses 
tests 
confidence intervals 
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
learning objectives
► statistics





►
 load, modify and save data
 basic statistical tools and graphics
 perform tests and build confidence intervals
► (MSN)
 Chapter 2
► (CS)
 Height, Marriages and Preferences
null and alternative hypotheses
testing a hypothesis
pvalue
test significance level and test power: type I and type II errors
confidence intervals: construction and interpretation
readings
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
Managerial Economics &
Decision Sciences Department
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
Why Taller Wife Couples Are So Rare
a first look at statistical hypotheses and tests
► The article “Why It's So Rare for a Wife to Be Taller Than Her Husband” asserts that the most common married
couples have the husband five to six inches taller, and a small minority of couples have the wife taller. For 2009,
based on a sample of 4,600 married couples, the distribution of the difference (in inches) between husband’s and
wife’s height is shown in the diagram below.
quiz
Suppose it’s year 2009
and you encounter a pair of two
Americans (male and female). You
estimate the height difference between
the two (male height – female height)
to be negative 2 inches, i.e. the female
is taller than the male.
► Do you think the pair is actually a
husband  wife couple?
12.00%
11.00%
10.00%
9.00%
8.00%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
23 to 24
22 to 23
21 to 22
20 to 21
19 to 20
18 to 19
17 to 18
16 to 17
15 to 16
14 to 15
13 to 14
12 to 13
11 to 12
10 to 11
9 to 10
8 to 9
7 to 8
6 to 7
5 to 6
4 to 5
3 to 4
2 to 3
1 to 2
0 to 1
-1 to 0
-2 to -1
-3 to -2
-4 to -3
-5 to -4
-6 to -5
-7 to -6
-8 to -7
-9 to -8
-10 to -9
-11 to -10
Assume that the sample distribution is the
same for the whole population of all married
couples.
Figure 1. Height difference for married couples
difference between husband’s and wife’s height in inches
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 1
Managerial Economics &
Decision Sciences Department
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
Why Taller Wife Couples Are So Rare
a first look at statistical hypotheses and tests
Answer
Common sense indicates that for married couples it is very rare to see the woman 2 inches taller than the
man (based on the provided distribution the likelihood of this situation is about 0.9218%). According to the this
reasoning, you would probably conclude that it is very unlikely that the pair is a husband  wife couple.
► In fact the likelihood of obtaining
even more (extreme) negative height
differences than the currently observed
height difference is 1.9722%. This
likelihood (called the “pvalue”) gives
you a measure of how far in the “tail” of
the distribution your currently observed
value is.
12.00%
11.00%
10.00%
9.00%
8.00%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
23 to 24
22 to 23
21 to 22
20 to 21
19 to 20
18 to 19
17 to 18
16 to 17
15 to 16
14 to 15
13 to 14
12 to 13
11 to 12
10 to 11
9 to 10
8 to 9
7 to 8
6 to 7
5 to 6
4 to 5
3 to 4
2 to 3
1 to 2
0 to 1
-1 to 0
-2 to -1
-3 to -2
-4 to -3
-5 to -4
-6 to -5
-7 to -6
-8 to -7
-9 to -8
-10 to -9
-11 to -10
► The lower this likelihood is the
farther the observed value is in the tail
of the distribution. An indication that
the observed value is unlikely to “come
from” the considered distribution.
Figure 2. Tail or “extreme” values
more extreme height differences than the observed one
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 2
Managerial Economics &
Decision Sciences Department
session one
statistical models: hypotheses, tests & confidence intervals
Why Taller Wife Couples Are So Rare
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
a first look at statistical hypotheses and tests
► Summary of approach:
hypothesis
 the hypothesis was stated as “the pair is married” (notice that data/information was available only
for married couples) which allows you to reason further…
test
 …“if the hypothesis is true then the observed height difference should not be extremely unlikely
under the provided distribution for married couples”
decision
 you have to set the level at which “extremely unlikely” is indicative that the observed height
difference does not come from the provided distribution for married couples and thus reject the
hypothesis beyond any reasonable doubt.
► Notice how the hypothesis is formulated in a way that allows you to look for “credible” or “beyond any reasonable
doubt” evidence against the hypothesis. If this evidence is found then you can reject the hypothesis.
quiz
If the only information available is the distribution of height difference for married couples, could the
hypothesis “the pair is not married” be tested?
Answer
No. This hypothesis cannot be tested in the absence of information about the height difference for
nonmarried persons since there is no way to find any evidence against the hypothesis “the pair is not married”.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 3
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Why Taller Wife Couples Are So Rare
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
a first look at statistical hypotheses and tests
key concept: statistical testing components
 a hypothesis providing a statement about a variable with assumed (known) distribution
 a statistical test providing quantitative evidence used to potentially reject the hypothesis
 a rule prescribing how the result of the statistical test is used to accept/reject the stated hypothesis
► Once the hypothesis is stated and the underlying distribution is determined/assumed the following logical
implications are considered:
logical implication flow
hypothesis is true
hypothesis is not true
test value comes from assumed
underlying distributionun
test value does not come from
assumed underlying distribution
likely test value is not farther into the
tail of assumed distribution
(high pvalue)
likely test value is farther into the tail
of assumed distribution
(low pvalue)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
logical implication flow
Figure 3. Hypothesis testing and pvalue
session one | page 4
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
Why Taller Wife Couples Are So Rare
a first look at statistical hypotheses and tests
quiz
Suppose it’s year 2009 and you encounter a pair of two Americans (male and female). You estimate the
height difference between the two (male height – female height) to be positive 8 inches, i.e. the male is taller than the
female.
► Do you think the pair is actually a
husband  wife couple?
Again, assume that the sample distribution
is the same for the whole population of all
married couples.
Figure 4. Tail or “extreme” values
12.00%
11.00%
10.00%
9.00%
8.00%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
23 to 24
22 to 23
21 to 22
20 to 21
19 to 20
18 to 19
17 to 18
16 to 17
15 to 16
14 to 15
13 to 14
12 to 13
11 to 12
10 to 11
9 to 10
8 to 9
7 to 8
6 to 7
5 to 6
4 to 5
3 to 4
2 to 3
1 to 2
0 to 1
-1 to 0
-2 to -1
-3 to -2
-4 to -3
-5 to -4
-6 to -5
-7 to -6
-8 to -7
-9 to -8
-10 to -9
-11 to -10
Answer
The likelihood to observe
height differences less than 8 inches is
71.1852% and the likelihood to
observe height differences larger than
8 inches is 28.8148%. It does not
appear to have any credible evidence
against the hypothesis that the pair is
a husband  wife couple.
height differences lower than the
observed one
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
height differences higher than the
observed one
session one | page 5
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
Why Taller Wife Couples Are So Rare
a first look at statistical hypotheses and tests
► A somewhat different way to use the distribution is to derive an interval centered in the mean height difference
(approximately 6 inches) such that a proportion 1   of married couple will have the height difference within this
interval.
► Say   10%; we need the lower
bound (lb) and upper bound (ub) such
that the sum of the frequencies (“area
under the curve”) between lb and ub is
exactly 90%. The added constraint is
that the interval is centered in the
mean of the distribution.
► For lb  0 and ub  12 we get the
area under the curve between lb and
ub approximately 90.3334%.
Figure 5. “Confidence interval” for   10%
12.00%
11.00%
10.00%
9.00%
8.00%
7.00%
6.00%
5.00%
area  1 – 
4.00%
3.00%
2.00%
1.00%
0.00%
23 to 24
22 to 23
21 to 22
20 to 21
19 to 20
18 to 19
17 to 18
16 to 17
15 to 16
14 to 15
13 to 14
12 to 13
11 to 12
10 to 11
9 to 10
8 to 9
7 to 8
6 to 7
5 to 6
4 to 5
3 to 4
2 to 3
1 to 2
0 to 1
-1 to 0
-2 to -1
-3 to -2
-4 to -3
-5 to -4
-6 to -5
-7 to -6
-8 to -7
-9 to -8
-10 to -9
-11 to -10
90% of married couples will exhibit height
differences between 0 inches and 12 inches
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 6
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
MBA Demographics
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
► The average age at enrollment time for
Kellogg’s PartTime MBA 2014 Entrants is about
28.8 years. Total number of entrants is in the
range 500600.
► The age distribution for 500 of Kellogg’s
PartTime MBA 2015 Entrants is shown below. For
the moment let’s refrain from calculating directly the
average age and female percentage.
(Details in the file MBADemo.dta)
Figure 6. PartTime MBA 2014 Entrants
Figure 7. Age Distribution PartTime MBA 2015 Entrants
20.00%
18.00%
16.00%
14.00%
12.00%
10.00%
8.00%
6.00%
4.00%
2.00%
0.00%
24.8 25.6 26.5 27.3 28.1 28.9 29.8 30.6 31.4 32.3 33.1 33.9 34.7 35.6 36.4
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 7
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
► A random sample of size 40 was extracted from the population of 500 students. For full disclosure and later use,
the procedure to extract the random sample is described below:
1
shuffle the initial dataset
generate random  runiform()
sort random
2
choose the first 40 observations from the shuffled dataset
sample 40, count
3
visualize the summary statistics
summarize
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
In the first step we add a new column called
“random” in which we have random numbers
between 0 and 1. At this point the dataset in the
memory has the initial 500 observations thus
there will be 500 random numbers added. When
we sort the dataset by column “random” we
basically rearrange (randomly) the observations
in the data set. Finally, we pick the first 40
observations and since the initial observations
we randomly shuffled, we can say that the
sample we obtain is a random sample (i.e., we
did not “cherrypick”.)
session one | page 8
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
Figure 8. PartTime MBA 2015 Entrants – Population (left) and randomly generated sample (right)
Index
1
2
3
Age
27.4
30.6
28.2
Gender
1
1
1
Index
18
432
112
Age
32.2
30.8
28.6
Gender
0
1
0
…
…
…
…
…
…
38
39
40
41
29.3
30.7
29.5
28.6
0
0
0
0
370
174
388
28.5
30
32.8
0
0
0
…
…
…
498
499
500
28.1
27.3
33.4
0
0
0
generate random  runiform()
sort random
sample 40, count
drop random
Figure 9. Sample Summary Statistics
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+---------------------------------------------------------Age |
40
29.5125
1.824504
25.9
35.1
Gender |
40
.325
.4743416
0
1
sample
size
sample
mean
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
sample
std. deviation
session one | page 9
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
key issue
inference at population level
► The first key problem we want to address is the following: based solely on the available sample, can we infer
anything about the average age at the population level relative to a pre-set benchmark X0?
► We will answer the above question using the framework introduced earlier:
hypothesis
test
decision
 a statement about E[X]
 a value that can be compared to an assumed/implied distribution under the stated hypothesis
 the decision rule evaluates how far is the test into the tail(s) of assumed/implied distribution
Remark In our “controlled environment” we have access to the population dataset from which we extracted the
sample. This is for pedagogical purposes as we will perform the inference on some statistics based solely on the
available sample and then evaluate our analysis by “revealing” the statistics at the population level.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 10
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
hypothesis
test
decision
key concept hypothesis
► The hypothesis is a statement relating the true mean E[X] with a pre-set benchmark X0. The three
typical hypotheses are:
 “less than”
E[X]  X0
(H: the true mean is less than X0)
 “greater than”
E[X]  X0
(H: the true mean is greater than X0)
 “different than”
E[X]  X0
(H: the true mean is different than X0)
Remark The pre-set benchmark can be any number you wish, however the choice for X0 is a meaningful one.
► Consider the Part-Time MBA 2014 Entrants’ average age as X0 = 28.8 and our hypothesis to be that the average
(mean) age E[X] for Part-Time MBA 2015 Entrants is less than 28.8 years.
 “less than”
E[X]  X0
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 11
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
hypothesis
test
key concept null and alternative hypotheses
► The hypothesis we want to test is called the null hypothesis (H0) and the opposite of the null
hypothesis is called the alternative hypothesis (Ha).
decision
Remark The null and alternative hypotheses together should be MECE (i.e., mutually exclusive and collectively
exhaustive):
H0: E[X]  X0
H0: E[X]  X0
H0: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
► The purpose of the testing step is to evaluate how strong is the evidence against the null hypothesis, i.e. try to find
credible evidence (beyond any reasonable doubt) to reject the null hypothesis.
► This is the only way in which whatever decision we make “reject H0” or “cannot reject H0” we can say that the
decision was made beyond any reasonable doubt.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 12
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
MBA Demographics
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
hypothesis
test
decision
► A statement relating the true mean E[X] with a pre-set benchmark X0.
key concept sample estimator
► A statistics calculated at the population level (e.g., mean, standard deviation) can be estimated by
the corresponding statistics based on an available sample with n observations extracted from the
population.
1 n
X i is an estimator for the true mean E[X].

n i 1
► The sample average of 29.51 years is the sample-based estimator for mean age of Part-Time MBA 2015 Entrants
(a population of size 500).
Remark A simple example: the sample average X 
quiz
Using the sample-based average X as an estimator for E[X] it seems we find evidence against the
hypothesis that E[X]  X0. However we do not have information about the distribution of age for the Part-Time MBA
2015 Entrants to gauge how far is 29.51 from 28.8. What distribution should we use?
X  X0
Answer
(Not obvious at all.) There is a theoretical result that provides the distribution of the ratio
that
s
X
can be used in most parameter testing cases.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 13
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
hypothesis
test
decision
► A statement relating the true mean E[X] with a pre-set benchmark X0.
key concept distribution of ttest statistics
► Suppose that we obtain a sample with n observation for a variable X with assumed mean X0.
Then
X  X0
ttest 
~ t  distribution
sX
Remark About notations: X is the sample mean of X and sX is the sample standard error of the mean. For this case,
the t distribution has n – 1 degrees of freedom.
► The t distribution - a few important properties:
Figure 10. Density function for the t distribution
 has a zero mean and is symmetric around its mean, thus its density
is always centered in zero
 has one parameter “degrees of freedom” (df) that affects its shape, in
particular how thick are the tails (affecting the variance only)
 You can generate the density function for the t distribution using the
STATA reserved function tden(df,x), range(a b):
twoway function tden(df,x), range(4 4)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 14
session one
Managerial Economics &
Decision ►
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
hypothesis
test
decision
► A statement relating the true mean E[X] with a pre-set benchmark X0.
► Provides calculated ttest that, given the hypothesis is true, comes from a t distribution
key concept the pvalue
► The pvalue is the probability that a tdistributed variable would be further into the tail(s) than the
calculated ttest.
Figure 11. Left tail pvalue
1 – ttail(df,ttest)
ttest
pvalue  Pr[ t  ttest ]
area to the left of ttest
Figure 12. Right tail pvalue
Figure 13. Two tail pvalue
ttail(df,ttest)
1 – ttail(df,|ttest|)
ttest
–|ttest|
pvalue  Pr[ t  ttest ]
area to the right of ttest
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
ttail(df,|ttest|)
|ttest|
pvalue  Pr[ t |ttest|]+Pr[ t |ttest|]
area to the left of |ttest| and right of |ttest|
session one | page 15
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
MBA Demographics
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
hypothesis
test
decision
► A statement relating the true mean E[X] with a pre-set benchmark X0.
► Provides calculated ttest that, given the hypothesis is true, comes from a t distribution
key concept significance level
► The significance level is the threshold for pvalue below which you considered pvalue small
enough to indicate that the calculated ttest is too far into the tail of the t distribution and thus evidence
against the stated hypothesis.
Remark The common notation for significance level is the Greek letter (0,1). The significance level is usually
expressed as percentage (e.g., 5%, 10%, etc.)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 16
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
hypothesis
test
decision
► A statement relating the true mean E[X] with a pre-set benchmark X0.
► Provides the calculated ttest that, given the hypothesis is true, comes from a t distribution
► Evaluates whether the calculated ttest is too far into the tail of the t distribution to represent
evidence against the stated hypothesis.
Figure 14. Hypothesis testing
hypotheses pair
H0: E[X]  X0
Ha: E[X]  X0
H0: E[X]  X0
Ha: E[X]  X0
H0: E[X]  X0
Ha: E[X]  X0
calculated ttest and distribution
calculated p–value
accept/reject decision
ttest 
X  X0
~ t  distribution
sX
p  value  Pr[t  ttest ]
if p–value <  reject H0
otherwise cannot reject H0
at the significance level 
ttest 
X  X0
~ t  distribution
sX
p  value  Pr[t  ttest ]
if p–value <  reject H0
otherwise cannot reject H0
at the significance level 
p  value  Pr[t   | ttest |] 
Pr[t  | ttest |]
if p–value <  reject H0
otherwise cannot reject H0
at the significance level 
X  X0
ttest 
~ t  distribution
sX
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 17
Managerial Economics &
Decision ►
session one
Developed for
statistical models: hypotheses, tests & confidence intervals
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
► Regardless of which of the previous three pairs of null/alternative hypotheses you test, you’ll start with the ttest:
ttest Age  28.8
X0
You have to specify the variable for which you conduct the test (Age) and what is the assumed mean (here
we are using the mean of the population, but you can use any other value if that is the benchmark against
which you test the sample mean.)
Figure 15. Results of ttest command
Variable |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Age |
40
29.5125
.2884794
1.824504
28.929
30.096
-----------------------------------------------------------------------------mean  mean(Age)
t 
2.4698
Ho: mean  28.8
degrees of freedom 
39
X  29.51
sX  0.29
calculated ttest
X  X 0 29.51  28.8

 2.47
sX
0.29
► Here’s the ttest calculated “manually”:
ttest 
► Notice that:
Std .Err . 
Std .Dev . 1.824504

 0.29
Obs.
40
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 18
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
► Consider: H0: E[X]  X0
Ha: E[X]  X0
Note that a X significantly lower than X0 (implying ttest  0) might raise some
concerns over H0 being true.
► A very negative ttest means this value is very much into
the left tail of the t distribution. This implies that the area to
the left of ttest is small too.
Figure 16. pvalue for H0: E[X]  X0
► The pvalue is the probability that a t distributed variable
takes even more extreme values than the calculated ttest:
pvalue  Pr[t  ttest]
► Graphically this is the area to the left of calculated ttest.
► The likelihood of rejecting H0 increases as the pvalue is
smaller. This is an indication of ttest being too far into the tail
of a t distribution  this should be an unlikely outcome if
indeed ttest is t distributed.
pvalue is the
area to the left
of ttest
ttest
You can calculate pvalue as 1 – ttail(df,ttest)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 19
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
► Consider: H0: E[X]  X0
Ha: E[X]  X0
Note that a X significantly higher than X0 (implying ttest  0) might raise some
concerns over H0 being true.
► A very positive ttest means this value is very much into
the right tail of the t distribution. This implies that the area to
the right of ttest is small too.
► The pvalue is the probability that a t distributed variable
takes even more extreme values than the calculated ttest:
pvalue  Pr[t  ttest]
► Graphically this is the area to the right of calculated ttest.
► The likelihood of rejecting H0 increases as the pvalue is
smaller. This is an indication of ttest being too far into the tail
of a t distribution  this should be an unlikely outcome if
indeed ttest is t distributed.
Figure 17. pvalue for H0: E[X]  X0
pvalue is the
area to the
right of ttest
ttest
You can calculate pvalue as ttail(df,ttest)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 20
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
► Consider: H0: E[X]  X0
Ha: E[X]  X0
Note that a X significantly different than X0 (implying either ttest  0 or ttest  0)
might raise some concerns over H0 being true.
► A very negative or very positive ttest means this value is
very much into the left or the right tail of the t distribution. This
implies that the area to the left of |ttest| plus the are to the
right of |ttest| is small too.
► The pvalue is the probability that a t distributed variable
takes even more extreme values than the calculated ttest:
pvalue  Pr[t   |ttest|]  Pr[t  |ttest|]
► Graphically this is the area to the left of |ttest| plus the
area to right of |ttest|.
► The likelihood of rejecting H0 increases as the pvalue is
smaller. This is an indication of ttest being too far into one of
the tails of a t distribution – this should be an unlikely
outcome if indeed ttest is t distributed.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
Figure 18. pvalue for H0: E[X]  X0
pvalue is the sum of
area to the left of
–|ttest|
–|ttest|
area to the right of
|ttest|
|ttest|
You can calculate pvalue as 2*ttail(df,abs(ttest))
session one | page 21
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
key concept type I and type II errors
► The analysis so far indicates that, regardless of the choice of null and alternative hypotheses, the smaller the
corresponding pvalue the more evidence we have against the null (i.e., more likely to reject H0).
► But how small should pvalue be in order to be sure we can reject H0? The key point to grasp is here is that you
can never be sure that the null is false. Thus you have to set a threshold () such that you are comfortable rejecting
H0 whenever the calculated p–value is below this threshold.
► What  should you choose?
► The choice of  should be made independently of calculating the pvalue, the significance level is really a
subjective choice made by the analyst. However, in choosing the significance level  keep this in mind:
type I error a large  means it is easier to reject the null H0 (i.e., you will reject H0 for a wider range of values for
the p–value). You are likely to reject H0 when actually H0 is true
type II error a small  means it is harder to reject the null H0 (i.e., you will reject H0 for a narrower range of values
for the p–value). You are likely to accept H0 when actually H0 is false
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 22
session one
Managerial Economics &
Decision ►
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
► Back to the output of our previous ttest:
ttest Age  28.8
X0
Notice that the STATA output provides the pvalue for three alternative hypotheses. It is analyst’s job to
choose the correct alternative Ha. This is not difficult since the choice of the alternative is the opposite of the
null that the analyst has stated from the beginning.
Figure 19. Results of ttest command
Variable |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Age |
40
29.5125
.2884794
1.824504
28.929
30.096
-----------------------------------------------------------------------------mean = mean(Age)
t =
2.4698
Ho: mean = 28.8
degrees of freedom =
39
Ha: mean < 28.8
Pr(T < t) = 0.9910
Ha: mean != 28.8
Pr(|T| > |t|) = 0.0180
Ha: mean > 28.8
Pr(T > t) = 0.0090
H0: E[X]  X0
H0: E[X]  X0
H0: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
p–value  0.991
p–value  0.018
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
calculated ttest
p–value  0.009
session one | page 23
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
Figure 20. Graphical interpretation for results of ttest command
mean = mean(Age)
Ho: mean = 28.8
t =
2.4698
degrees of freedom =
Ha: mean < 28.8
Pr(T < t) = 0.9910
Ha: mean != 28.8
Pr(|T| > |t|) = 0.0180
H0: E[X]  X0
H0: E[X]  X0
H0: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
p–value  0.991
p–value  0.018
2.47
– 2.47
You can calculate the corresponding p–value using:
1 – ttail(39,2.47)
2*ttail(39,2.47)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
39
Ha: mean > 28.8
Pr(T > t) = 0.0090
p–value  0.009
2.47
2.47
ttail(39,2.47)
session one | page 24
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
quiz
Which null hypothesis would you be able to reject?
Answer
The accept/reject decision depends on the chosen significance level . Say   0.05 (5.00%) then
Figure 21. Results of ttest command
mean = mean(Age)
Ho: mean = 28.8
t =
2.4698
degrees of freedom =
Ha: mean < 28.8
Pr(T < t) = 0.9910
Ha: mean != 28.8
Pr(|T| > |t|) = 0.0180
H0: E[X]  X0
H0: E[X]  X0
H0: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
Ha: E[X]  X0
p–value  0.991
p–value  0.018
cannot reject null at 5%
significance level as
pvalue  
reject null at 5%
significance level as
pvalue  
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
39
Ha: mean > 28.8
Pr(T > t) = 0.0090
p–value  0.009
reject null at 5%
significance level as
pvalue  
session one | page 25
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
key issue
inference at population level
► The second key problem we want to address is the following: based solely on the available sample, can we infer
anything about the range of the average age at the population level?
Remark The sample provides an estimator X for E[X], and we analyzed a way to test various relations between E[X]
and a pre-set benchmark X0. However the tests do not provide an idea on how far apart are X and E[X]. The aim is to
be able to make statements like: “with a given probability the interval [lb,ub] contains the true mean E[X]”.
key concept confidence intervals
► The interval, centered in the sample-based average, that would contain the true mean with a given probability
(confidence level) is referred to as a confidence interval.
► We already know that:
ttest 
X  E[ X ]
~ t  distribution
sX
► Then for any number x we can calculate then the following probability:


X  E[ X ]
Pr[  x  ttest  x ]  Pr  x 
 x   area between x and x under the t distribution density
sX


© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 26
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
“Direct” problem
► For a given x calculate Pr[ x  ttest  x]
This probability is the area between – x and x under the t distribution density.
Using the symmetry of the t distribution it’s easy to see that this area could be
calculated as
1  (area to the left of x)  (area to the right of x)
that is 1  (1  ttail(df,x))  (ttail(df,x))  1  2*ttail(df,x)
“Indirect” problem
► For a given A(0,1) find x such that Pr[ x  ttest  x]  1  A
Figure 22. “Direct” problem: Graphical solution
df = 39
x=2
1ttail(39,2)
(0.02625)
ttail(39,2)
(0.02625)
area between
2 and 2
is 0.9475
2
2
Figure 23. “Indirect” problem: Graphical solution
We can restate the “Indirect” problem as
 find x such that 1  (area to the left of x)  (area to the right of x)  1  A
But (area to the left of x)  (area to the right of x) and so the “Indirect” problem
becomes:
area to the
left of x
area between
x and x
is 1  A
 find x such that 1  2 (area to the right of x)  1  A
Finally:
 find x such that area to the right of x is A/2
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
area to the
right of x
x
x
session one | page 27
Managerial Economics &
Decision ►
MBA Demographics
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
formalizing hypothesis testing and confidence intervals
► The “Indirect” problem was reduced to solve for x such that area to the right of x is A/2
As a reminder, the STATA function ttail(df,x) provides the area to the right of x. Quite conveniently, STATA provides also the inverse
function invttail(df,area) that calculates the x for which the area to the right of x is exactly area.
► Using the STATA reserved function invttail(df,area) with area  A/2  0.025 we get the solution x  2.0226909 as
invttail(39,0.025)
Figure 24. “Indirect” problem: Graphical solution
► We can summarize the setup for the “Indirect” problem as follows:
for a given (0,1) solve for x such that Pr[ x  ttest  x]  1  .
► The solution is x  invttail(df,/2). You will find the standardized
notation tdf,/2  invttail(df,/2) thus x  tdf,/2.
area between
x and x
is 1  A  0.95
► Thus we know that x  ttest  x with probability 1   that is
 t df ,α / 2 
X  E[ X ]
 t df ,α / 2
sX
with probability 1  
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
area  0.025
area  0.025
2.022
2.022
–invttail(39,0.025)
invttail(39,0.025)
session one | page 28
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
► With a bit of algebra we can show that the statement  t df ,α / 2 
is equivalent to
X  s X  t df ,α / 2  E [ X ]  X  s X  t df ,α / 2
X  E[ X ]
 t df ,α / 2
sX
with probability 1  
with probability 1  
► This is exactly the confidence interval we were looking for. Since we already run the ttest command we do have all
information necessary to derive the interval:
Figure 25. Results of ttest command
Variable |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Age |
40
29.5125
.2884794
1.824504
28.929
30.096
-----------------------------------------------------------------------------mean  mean(Age)
t 
2.4698
Ho: mean  28.8
degrees of freedom 
39
X  29.51
sX  0.29
► For   5.00% we get tdf,/2  invttail(39,0.025)  2.0226909 and so the interval is
lb = 29.51  0.292.023  28.92333 and ub = 29.51  0.292.023  30.09667
The interval (28.92333, 30.09667) is an approximation as we used rounded values to simplify calculations.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 29
Managerial Economics &
Decision ►
session one
statistical models: hypotheses, tests & confidence intervals
Developed for
business analytics II
a first look at statistical hypotheses and tests ◄
formalizing hypothesis testing and confidence intervals ◄
confidence intervals ◄
MBA Demographics
formalizing hypothesis testing and confidence intervals
We can obtain directly the confidence interval for the mean using:
ci means var, level(1)
Remark. For var you have to specify the name of the variable exactly as defined in the data
set (e.g., Age) while for level you should specify it as 95 when   5%, 90 when   10%, etc.
ci means Age, level(95)
Figure 26. Results of ci command for   5%
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
---------+--------------------------------------------------------------Age |
40
29.5125
.2884794
28.929
30.096
1    95
variable name
X  29.51
sX  0.29
confidence interval always
centered in the mean
quiz
Suppose we construct the confidence interval for   10%. Is this confidence interval narrow/wider than
the confidence interval constructed for   5%?
Answer
Narrower!
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session one | page 30