Download start workshop - statistics

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
START WORKSHOP - STATISTICS
INSC 60010
STATISTICAL MODELS FOR MANAGERIAL
DECISIONS
START WORKSHOP - FALL 2013
Ranga Ramasesh
Page 1
START WORKSHOP - STATISTICS
1.
2.
3.
4.
5.
DISCUSSION TOPICS
Workshop Objectives
1. Data Analysis – Motivations and Goals
2. Preparatory Tools
Graphs for Exploratory Data Analysis
1. Histograms / Frequency Distributions
2. Scatter Plots
3. Time Series Plots
Models of Uncertainty
1. Sampling and Sampling Distributions
2. Z-Distribution
3. T-Distribution
Inference based on Sample data
1. Confidence Interval Estimation
2. Hypothesis Testing
Module Overview
1. Introduction to Regression Analysis
2. Syllabus and Administrative Details
Page 2
START WORKSHOP - STATISTICS
NOTE
USE OF GRAPHS FOR EXPLORATORY DATA ANALYSIS
Example: Graduation Rates
Data 142 colleges is available in a spreadsheet with filename prefix college06.
This information was obtained from the 2006 issue of U.S. News and World
Report and it includes:
1. name of the college
2. graduation rate (GRATE)
3. freshman retention rate (FRESH)
4. percent of classes with fewer than 20 students (CLASS20)
5. percent of classes with more than 50 students (CLASS50)
6. percent of full time faculty (FTFAC)
7. 75th percentile of SAT scores (SAT75)
8. percent of incoming students in top 10% of high school class (TOP10)
9. acceptance rate (ARATE)
10. alumni giving rate (ALUM)
11. indicator for private school (1 = private; 0 = public) (PRIV)
A small sub-set of data is shown in the following table.
Managerial Concerns
1. How do we make sense of the graduation rates across the different colleges?
2. How do we understand the relationship, if any, between graduation rate
and the SAT scores?
Page 3
START WORKSHOP - STATISTICS
Sample Data set
School Name
GRATE
FRESH
CLAS20
CLAS50
FTFAC
SAT75
TOP10
ARATE
ALUM
PRIV
Harvard University
0.98
0.97
0.70
0.13
0.92
1580
0.96
0.11
0.47
1
Princeton University
0.97
0.98
0.74
0.11
0.91
1560
0.94
0.13
0.61
1
Yale University
0.96
0.98
0.74
0.08
0.89
1560
0.95
0.10
0.46
1
University of Pennsylvania
0.94
0.98
0.75
0.07
0.88
1500
0.94
0.21
0.40
1
Duke University
0.94
0.97
0.72
0.05
0.97
1530
0.87
0.24
0.45
1
Stanford University
0.93
0.98
0.69
0.12
0.99
1550
0.87
0.13
0.38
1
California Institute of Technology
0.88
0.96
0.63
0.09
0.98
1570
0.93
0.21
0.32
1
Massachusetts Institute of Technology
0.92
0.98
0.61
0.16
0.91
1560
0.97
0.16
0.37
1
Columbia University
0.93
0.98
0.69
0.10
0.91
1540
0.86
0.13
0.34
1
Dartmouth College
0.95
0.97
0.61
0.10
0.93
1550
0.88
0.19
0.49
1
Washington University in St. Louis
0.92
0.97
0.74
0.08
0.92
1520
0.93
0.22
0.39
1
Northwestern University
0.92
0.97
0.73
0.08
0.93
1500
0.82
0.30
0.29
1
Cornell University
0.92
0.96
0.44
0.22
0.99
1490
0.85
0.29
0.35
1
Johns Hopkins University
0.91
0.95
0.55
0.17
1.00
1490
0.80
0.30
0.33
1
Brown University
0.96
0.97
0.65
0.12
0.94
1520
0.90
0.17
0.38
1
University of Chicago
0.87
0.95
0.55
0.06
0.95
1530
0.82
0.40
0.29
1
Rice University
0.91
0.96
0.60
0.10
0.93
1540
0.86
0.22
0.36
1
University of Notre Dame
0.96
0.98
0.56
0.10
0.85
1470
0.85
0.30
0.49
1
Vanderbilt University
0.86
0.94
0.67
0.06
0.97
1440
0.77
0.38
0.28
1
Emory University
0.86
0.94
0.67
0.07
0.95
1460
0.90
0.39
0.19
1
University of California - Berkeley
0.87
0.96
0.58
0.15
0.91
1450
0.99
0.25
0.15
0
Page 4
START WORKSHOP - STATISTICS
Frequency Distributions and Histograms
Definitions:
A frequency distribution is a table that summarizes the numerical values of a
variable by recording the number of times (frequency) values fall within
certain ranges called classes or bins.
Definition: A histogram is a graph of a frequency distribution.
Using Excel
A frequency distribution can be constructed in Excel by choosing the “Data”
tab, and then choosing “Data Analysis” from the “Analysis” category and then
choosing “Histogram” (or by choosing “Histogram” from the “Data Analysis”
option on the “Tools” menu in earlier versions of Excel).
The variable examined in this example will be the graduation rates for the
colleges in our sample. Bin limits are the upper inclusive values for each bin.
The bin limits chosen by Excel may be somewhat awkward values to work, so
Excel also allows you to specify the bin limits. Note that the numbers shown
in the Bin column of the Excel frequency distribution represent the upper
limits of the bin.
Note: If you don’t like the bin limits that Excel chooses (and I don’t), you can
choose your own. On the next page is another frequency distribution of
graduation rate using bin limits that I chose. The first (lower) bin limit is 0.20
with increments of 0.05 up to a maximum of 1.00. Excel will always add a last
row to the frequency distribution and label it “More”. You can delete it, as I
did here.
Page 5
START WORKSHOP - STATISTICS
Page 6
START WORKSHOP - STATISTICS
Page 7
START WORKSHOP - STATISTICS
Page 8
START WORKSHOP - STATISTICS
Frequency Distribution
bins
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Frequency
1
1
0
5
4
9
14
15
6
14
21
13
11
9
14
5
Histogram
This is the histogram (formatted) for the above frequency distribution.
Histogram
25
Frequency
20
15
10
5
0
25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
bins
Page 9
Scatterplots
START WORKSHOP - STATISTICS
Definition: A scatterplot is a plot showing the relationship between two
variables X and Y.
Suppose we want to examine the relationship between graduation rates and
SAT score. The following plot is the scatterplot of GRATE versus SAT75.
Page 10
START WORKSHOP - STATISTICS
Page 11
START WORKSHOP - STATISTICS
Page 12
START WORKSHOP - STATISTICS
Scatterplot of Graduation Rate versus SAT
1.00
0.90
0.80
y = 0.0011x - 0.7635
R² = 0.7596
Graduation Rate
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
800
900
1000
1100
1200
1300
SAT 75th Percentile
Page 13
1400
1500
1600
1700
START WORKSHOP - STATISTICS
Time Series Plots
Time-series plot (Line Plot) of furniture sales in millions of dollars,
January 1992 through December 2007
Data file: Furnsales
Page 14
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
START WORKSHOP - STATISTICS
FURNSALES
14000
12000
10000
8000
6000
4000
2000
0
Page 15
START WORKSHOP - STATISTICS
Art and Science of Graphical Presentations
Excellence in statistical graphics consists of complex ideas communicated
with clarity, precision, and efficiency. Graphical displays should
A.
Show the data
C.
Avoid distorting what the data have to say
B.
Induce the viewer to think about the substance rather than about
methodology, graphic design, the technology of graphic production, or
something else
D.
Present many numbers in a small space
F.
Encourage the eye to compare different pieces of data
E.
G.
H.
I.
Make large data sets coherent
Reveal the data at several levels of detail, from a broad overview to the
fine structure
Serve a reasonably clear purpose: description, exploration, tabulation,
or decoration
Be closely integrated with the statistical and verbal descriptions of a
data set.
Graphics reveal data. Indeed graphics can be more precise and revealing than
conventional statistical computations.
From: The Visual Display of Quantitative Information by Edward R. Tufte,
Cheshire, Connecticut: Graphics Press 1983.
Page 16
START WORKSHOP - STATISTICS
NOTE
STATISTICAL INFERENCE
Introduction
A common challenge faced by business managers is making judgments about the
characteristics of large populations. For example, managers in a large electronics retail
organization with several thousand stores all across the country may want to know the
average time per day spent by its salespersons with pseudo customers. (Pseudo customers
are those who come to retail stores to get help from salespeople in understanding and
comparing different products but stop short of buying anything in the store.) The managers
in this organization are interested in a characteristic of the entire group or the “population”
of the salespersons across all of its retail stores. In statistical terms, the managers are
interested in the numerical value of a “population parameter.” In the first case, the
parameter they are interested in is the “population mean”.
Statistical Inference is the technique of making judgments about the unknown
“population parameters” of our interest such as a population mean based on appropriate
“sample statistics.” For example, the appropriate sample statistic to estimate an unknown
mean of a population is the sample mean. There are two distinct, but related approaches to
statistical inference. These are called “Estimation” and “Hypothesis Testing”.
Estimation
In some situations a manager or an analyst may not know what the numerical value
of the population parameter of interest is (or may not even have a tentative or claimed
value for the population parameter). For example, what is the average number of miles
driven by families during a summer vacation or what proportion of the eligible voters will
vote in favor of a particular candidate in an upcoming election? In such the technique called
Page 17
START WORKSHOP - STATISTICS
“Estimation” is used. Estimation deals with the determination of the numerical value of an
unknown population parameter, such as the population mean. (i.e., the average value of a
specific measurement variable) using data from a random sample. The error that arises
from the fact that the estimates are based on the data from a relatively small sample is
quantified what is called the "standard error of the sample estimate”. In general, we
estimate the population parameter, by specifying an estimate (sometimes called the “point
estimate”) and a margin of error around it, i.e. by specifying a range of values called the
“confidence interval”. We multiply the standard of error of the sample estimate by an
appropriate factor to give us the margin of error that results in confidence intervals in
which we have a specified level of confidence. The appropriate factor is based on the
distribution of the sample estimate.
Note: The confidence level is the proportion of times that our
estimation procedure is correct.
Page 18
Hypothesis Testing
START WORKSHOP - STATISTICS
In some situations a manager or a business analyst may already have some idea
about the population parameter based on some intuition, or a claim made by others. For
example, a firm’s marketing department might claim that the average sales for a new model
of computer will be 500 units per week. The managerial concern here is whether to reject
the claim made and take an alternative position or not to reject the claim made and go with
it. In such situations, the “Hypothesis Testing” technique is used.
Recall that in the confidence interval approach, a manager has no idea about the
population parameter of interest (e.g., the population mean) and the manager constructs a
confidence interval using the sample estimate. The sample estimate is the value of an
appropriate statistic (i.e., the sample mean) based on a random sample of observations
drawn from the population.
In the hypothesis testing approach, a manager does have some specific numerical
value for the population parameter of interest. This value may come from a variety of
considerations. It may be the manager’s target value for the parameter or it may be a claim
made by someone. To test this claim, data from a random sample drawn from the
population is collected and it is used as evidence to test if it supports the claim or not.
An Informal Understanding of Hypothesis Testing
A good way to understand the hypothesis testing approach is to think of a criminal
trial. The jury’s concern here is the (unknown) truth about the defendant. There are two
positions: not guilty (defense’s position) or guilty (prosecution’s position). These are
mutually exclusive (i.e., non-overlapping) and collectively exhaustive (i.e., there are no
other positions). The jury must decide if it should (a) reject the defense’s position of “not
guilty” and go with the alternate position or (b) not reject it. How does the jury decide? It
looks at all the evidence presented and seeks answer to the question: Does the evidence
enable us to reject the “not-guilty” position beyond reasonable doubt?
Page 19
START WORKSHOP - STATISTICS
In an analogous way, in hypothesis testing we start with two mutually exclusive and
collectively exhaustive positions (or claims) about a population parameter. One of these is
called the null hypothesis and the other is called the alternate hypothesis. (We will
discuss the technicalities of how to designate the null and the alternate hypotheses later.)
Similar to the jury’s concern of whether to reject the defense’ “not-guilty” claim or not, our
concern is to take one of the following two decisions:
(a) Reject the Null Hypothesis and
(b) Do Not Reject the Null Hypothesis.
The jury considers a variety of evidence presented to it. In our case, the evidence is
simply the appropriate “test statistic” calculated from the data from a random sample.
How does the jury decide if there is evidence beyond reasonable doubt? For example, if the
prosecution establishes that the murderer was wearing blue jeans and points out that the
defendant owns a pair of blue jeans would it constitute evidence beyond reasonable doubt?
We would say: most likely not. Since many people own blue jeans, the likelihood or
probability of the evidence presented due to pure chance is quite high. There is nothing
extraordinary about this observation. It is not compelling enough. On the other hand, if the
prosecution presents evidence that the DNA of the defendant matches the DNA of the body
fluids found on the murder victim, we would say: most likely yes. The fundamental
consideration here is the likelihood or the chance associated with the evidence presented
or the sample outcome. The jury asks the question: “Assuming that the defendant is not
guilty, what is the probability that there is a DNA match?” Since the probability of a DNA
match is extremely small, the evidence that there is a DNA match is compelling enough to
reject our initial position that the defendant is not guilty. In a similar vein we first
tentatively assume that the null hypothesis is true. We then determine the probability of
getting a test statistic as large as or as small as the one we got. This probability is called
the p-value of the test statistic. (We will discuss how to calculate this probability “or pvalue” later.)
Page 20
START WORKSHOP - STATISTICS
We finally ask the question: Is this probability small enough for us consider that it is
rather extreme or extraordinary? If it is, we feel compelled to reject our tentative
assumption that the claim (or the hypothesized value) contained null hypothesis is true. To
answer the above question, we must establish how low the p-value should be for us to
consider that it is low enough for us to reject the null hypothesis. This is analogous to
the issue of what is considered “beyond reasonable doubt” in the jury decision. What is the
threshold for the p-value to be considered significant? Analysts often use the rule that a pvalue below 5% can be viewed as “statistically significant”. In different settings different
levels of significance may be appropriate. In medical research, for example, a significance
level of 1% to 2% may be required to consider the results publication worthy. In some
business applications 5% or even 10% may be acceptable. In the hypothesis testing jargon
the threshold or cutoff percentage or probability is called “the significance level for the
hypothesis test”. It is denoted by the Greek letter α (alpha).
If the p-value is less than the significance level, we conclude that the sample
outcome is so extreme that it could not have from a population having the parameter value
specified in the null hypothesis. Hence we reject the null hypothesis and adopt the position
stipulated in the alternate hypothesis. This is analogous to the jury’s conclusion that the
probability of a DNA match is so small that it does not seem reasonable to continue to
harbor the position that the defendant is not guilty and dismiss this evidence as fluke.
Hence the jury will reject the null position and return the “guilty” verdict.
The choice of an appropriate value for α is a managerial policy decision.
Generally speaking, if the consequences of mistaking chance variation for a real discovery
are very bad, managers use a very strict cutoff (a low number like 1% for the significance
level). If the consequences are not serious managers use a more lenient cutoff (like 5% or
10%). In medical research the consequences of this type of mistake may be serious because
an ineffective or dangerous vaccine may be approved by the FDA and given to millions of
people. The p-value is often quoted in research journals. A researcher may write something
like “we found that patients got better after taking our new drug (p-value < 0.01)” to mean
Page 21
START WORKSHOP - STATISTICS
that benefits of the drug were found to be statistically significant with a p-value of less than
1%. The advice to the manager is not to worry so much about exactly which level of
significance to use. If the conclusions of your study change drastically if you adopt a slightly
different level of significance, then your results are probably close to the borderline of
statistical significance — and you should consider this when basing decisions on the study.
Finally, note that the jury decisions are not perfect. A jury sometimes reaches a
wrong verdict. Likewise, a hypothesis test may sometimes lead to an incorrect decision.
There are two types of errors:
(1)
A jury may reject the “non-guilty” position and wrongly convict a defendant who is
truly not guilty. Likewise, a hypothesis test might reject a null hypothesis which is
indeed true. This type of error is called a Type I error. It occurs if a hypothesis test
finds the sample evidence to be statistically significant when in reality it is due to
(2)
pure chance variation.
A jury may fail to reject the “non-guilty” position and wrongly let go a defendant
who is truly guilty. Likewise, a hypothesis test might fail to reject a null hypothesis
which is indeed false. This type of error is called a Type II error. It occurs if the
sample evidence is due to something beyond chance variation but the test does not
recognize it as statistically significant.
The significance level (α) represents the maximum probability of committing a
Type I error that is acceptable. If Type I errors are very costly you might use a smaller
value for α such as 0.01 or 0.001. For example, in testing the effectiveness of a new drug for
a common ailment, committing a Type I error means that an ineffective drug may be
prescribed to millions of people. This is very bad and therefore it is appropriate to use a
small value for α such as 0.001. On the other hand, in the case of an experimental treatment
for a fatal but otherwise incurable disease, a Type II error is probably worse than a Type I
error. Now a Type II error means that someone could miss out on the opportunity to be
cured, and a higher value for α such as 0.01 might be better. In fact, you might consider
giving this treatment to someone even if it has only been tested on a few people in the past;
Page 22
START WORKSHOP - STATISTICS
waiting until you can establish the adequacy of the drug at the usual level of statistical
significance may be too conservative and patients may die in the meantime.
The maximum probability of committing a Type II error is often denoted by the
Greek letter β and (1- β) is called the power of a hypothesis test. In this course we will not
discuss the power of a test. But it is important for you to note that once you have chosen a
certain level of significance to control the probability of Type I error (i.e., chosen a value for
α), the probability of making a Type II error may be reduced by only increasing sample size.
Let us recap and formally state the steps in Hypothesis testing.
Step 1: State the Null and Alternate Hypotheses
 Note that in any decision context, there will be a certain specific numerical value
claimed for the unknown population parameter µ. Let us denote it by µ0 .
 There are three possible positions that you might take with respect to µ0 .
1. µ ≠ µ0
2. µ > µ0
3. µ < µ0
 Corresponding to each of the above positions the alternate positions are:
1. µ = µ0
2. µ ≤ µ0
3. µ ≥ µ0
 Always choose the position that has a strict inequality as the “Alternate Hypothesis”
(usually denoted by Ha). Conversely, the position which has an equality component
will be chosen as the “Null Hypothesis” (denoted by H0).
Thus, the three possible scenarios are:
1. H0: µ = µ0
2. H0: µ ≤ µ0
3. H0: µ ≥ µ0
Ha: µ ≠ µ0
Ha: µ > µ0
Ha: µ < µ0
Scenario 1 is called a “two-tailed” test. Scenarios 2 and 3 are “one-tailed” tests.
Page 23
START WORKSHOP - STATISTICS
Step 2: Determine the Test Statistic (based on the sample data) or “Observed TS”
𝑇𝑆 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 𝑉𝑎𝑙𝑢𝑒 𝑐𝑙𝑎𝑖𝑚𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒
Step 3: See if the Test Statistic is significant (Two approaches)
1. Find the Critical value of the Test Statistic corresponding to the significance level of the
hypothesis test and establish the Rejection Region.
The Critical Value of the Test Statistic for the specified significance level is found using
the distribution of the test statistic. The type of test determines the Rejection region. In
this situation (two-tail test), we want to know how large or small should the observed
test statistic be so that we can consider it as large enough at the specified significance
level and hence reject the null hypothesis.
2. Find the P-value of the Observed Test Statistic
The p-value is the probability of observing a value for the test statistic as extreme as
(i.e., as large as or as small as) or more extreme than the one we observed under the
assumption that the null hypothesis is true. The p-value associated with the TS is found
using the distribution of the test statistic.
Page 24
START WORKSHOP - STATISTICS
TWO-TAILED TEST
ONE-TAILED TESTS
Step 4: Make the Statistical Decision
1. Reject the Null if the Observed Test Statistic falls in the Rejection Region
2. Reject the Null if P-value is less than α-value
Step 5: State the Managerial Conclusion in plain English
Page 25
START WORKSHOP - STATISTICS
Standard Deviation of the Sample Estimate
In the above discussion we used the standard error of sample estimate (a) to
establish a desired confidence interval for an unknown population parameter or (b) to test
a claim about the population parameter. How do we get this standard error of estimate? To
answer this question, we must understand the behavior of samples randomly drawn from
the population of interest in our study.
First, we will first understand the behavior of the sample means (𝑋�) when we take a
sample (of size n) from a population with a known population mean (µ). This behavior is
described by what is called the “sampling distribution of sample means”. It is the
foundation for statistical inference. Also, it has significant applications in statistical process
control.
Then we will learn the application of the concepts and the techniques of “Confidence
Interval Estimation” and “Hypothesis Testing” in the context of a case. We will limit our
focus to the estimation of unknown population mean and testing claims about it.
Important Note: Statistical Significance versus Practical Significance
Statistical significance doesn’t tell you whether or not the results are of practical
significance. Something is statistically significant if it is clearly more than just a chance
occurrence, whereas something is practically significant if it would have an important
impact. If you gather enough data everything looks statistically significant because then
there is very little room for chance variation. Something is statistically significant if it is
clearly more than just a chance occurrence. Something is practically significant if it would
have an important impact on the business situation.
Page 26
START WORKSHOP - STATISTICS
NOTE
SAMPLING DISTRIBUTION OF SAMPLE MEANS
A Thought Experiment
Before we get into the theoretical concepts, definitions, and analytical details, please
visualize playing a simple game and answer the questions that follow. In this game let us go
back to the case of a consumer products company like Proctor and Gamble. This firm
manufactures liquid detergent, which is sold in 100-ounce plastic bottles. In the final stages
of the manufacturing process the 100-ounce plastic bottles are filled with liquid detergent
on an automated filling and packaging line.
The automatic bottle-filling machine used to fill the bottles is set to fill an average of
100 ounces of the detergent in each bottle. However, no machine is guaranteed to fill
exactly 100 ounces in each bottle. Rather, the fill amount varies from bottle to bottle. Thus,
some bottles will have slightly more than 100 ounces and some slightly less, although these
bottles will be labeled as “100 ounce” bottles. It has been established through extensive
data analysis that the variability in the “fill volume” is adequately represented by a normal
distribution with a mean of 100 ounces and a standard deviation of 0.2 ounce. The
company distributes shrink-wrapped bundles of one-dozen or 12 bottles to its retails. Now
let us think of the average fill volume in a bundle of 12 bottles randomly selected from the
output of the machine.
1. Before we draw a 12-bottle bundle (i.e., a sample of 12 bottles), what do you expect
“Average Fill Volume” of this sample of 12 bottles - call it the “Sample Mean”?
_______________________________________________________________________________
2. We draw a bundle at random i.e., sample of 12 bottles, measure the volume of the 12
individual bottles and find the sample average. Suppose this average value, i.e. the
sample mean is 98 ounces. Would you consider this to be significantly lower than what
you expect? How could you tell?
________________________________________________________________
Now let us formalize our thoughts and pick up some theoretical fundamentals.
Page 27
START WORKSHOP - STATISTICS
Distribution of Sample Means
Consider a population of observations such as heights, weight, test scores, weekly
demands, salaries, and so on.
µ = Population Mean
σ = Population standard deviation
We draw random samples of n observations. In statistical jargon, we say
Sample size = n
The sample mean is a random variable. It varies from one sample to another. To work with
sample means, i.e. to make judgments about the sample means or to use sample means to
make judgments about the populations, we must know the distribution of the sample
means. This distribution is called the sampling distribution of sample means. Recall that a
distribution must specify at three things: The central measure or the mean, the variability
around the mean or the standard deviation and the shape of the distribution.
From the Central Limit Theorem in Statistics, we have the following result:
𝑀𝑒𝑎𝑛 𝑜𝑟 𝑡ℎ𝑒 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑋�,
𝐸(𝑋�) = 𝜎
𝑇ℎ𝑒 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑋�,
𝜎
𝑆𝐷𝑋� =
√𝑛
𝑇ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑡ℎ𝑒 𝑵𝒐𝒓𝒎𝒂𝒍 𝑴𝒐𝒅𝒆𝒍.
Page 28
START WORKSHOP - STATISTICS
The following conditions must be satisfied for the above result to hold:
1. Randomization Condition
 Sampling method must be unbiased and representative of the population
2. 10% Condition
 The sample size, n, must be no more than 10% of the population size, N. If it is, then a
“Finite Population Correction Factor” must be applied to the standard error.
3. Nearly Normal Condition and Sample size Requirement
 The data must come from a distribution that is unimodal and approximately symmetric.

If the data distribution is known to be normal then any sample size is OK.
 If the data distribution is not known to be normal, then the sample size must be must be ≥
30.
 The approximation to the normal distribution will become closer as the sample size
increases. If the parent distribution is symmetric, smaller samples are adequate than if the
parent population is skewed or long-tailed. Symmetry of the parent distribution is
particularly important.
Page 29
START WORKSHOP - STATISTICS
Example:
In 2008, the average salary for federal workers whose occupations also exist in the private
sector was $67,691. By contrast, the average salary for employees working in similar jobs
in the private sector in 2008 was $60,046. Assume that the population standard deviation
of the salaries of federal workers is $15,300. A random sample of 34l employees is selected.
a) What is the probability that the sample mean will be less than $64000?
b) What is the probability that the sample mean will be more than $70000?
c) What is the probability that the sample mean will be less than $60046?
Analysis and Solution
First, we make sure that the conditions are satisfied. In this case, we have (1) a random
sample (2) the sample size is ≥ 30, and (3) reason to believe that the 10% condition is
satisfied.
We are interested finding probabilities about the sample mean, i.e. the average salary for
�.
federal workers whose occupations also exist in the private sector. Let us call it 𝑋
� . The distribution of 𝑋� is Normal with
We must determine the distribution of 𝑋
Mean = Population mean = 67691
Standard Deviation = 𝑆𝐷𝑋�
=
𝜎
√𝑛
=
15300
√34
= 2623.98
Now, we can find the desired probabilities using the NORM.DIST function in EXCEL.
In fact, we can do all the computation in an EXCEL worksheet.
Page 30
START WORKSHOP - STATISTICS
NOTE
ESTIMATION OF AN UNKNOWN POPULATION MEAN
Guided Case Analysis: Retail Store Operation
ELCO is a chain of stores that sells consumer electronics in the US. ELCO suspects that the
main reasons for declining profits are the falling quality of service and growing
competition. Managers at ELCO want to know the average time that a salesperson spends
with customers. They are worried about pseudo customers who take up a salesperson’s
time to get details about a product but then make their purchases elsewhere. ELCO
managers are concerned about the average time spent by the sales persons with pseudo
customers across the entire population of sales persons in the company.
Specifically, ELCO managers are interested estimating the population average time spent
with pseudo customers by the company’s sales persons. They want a 95% confidence in
their estimate.
ELCO has collected data on the service time spent with pseudo customers in a day from a
random (i.e. representative) sample of 100 salespersons. The data set is given in the
following table. See EXCEL file service.
Page 31
Data Set
START WORKSHOP - STATISTICS
Time spent by salespersons with pseudo customers.
Service time in seconds
3897
6743
6692
5301
2466
5702
4973
3482
5456
6981
3589
4320
1245
562
6824
9010
8910
1003
8821
5797
6712
1349
4239
2134
4687
1688
8904
3099
921
5817
6984
2485
8901
4111
8903
8933
6986
7133
2349
9042
7120
4713
4344
5921
1471
7432
7059
8425
7027
5479
6934
7234
1358
2302
8324
2309
2329
7912
2399
4456
7632
11921
1357
5691
3216
4865
9249
8349
3369
4771
9214
5578
2316
1279
3130
5892
3870
2390
3190
7243
2390
2891
8238
4349
1208
3999
4389
2348
5681
3123
4992
3356
1217
1109
5002
4006
1730
2100
2305
7349
Page 32
Analysis and Solution
START WORKSHOP - STATISTICS
The variable of our interest here is “the time spent by salespersons with pseudo
customers.” Let us denote it by the symbol X. X is a random variable because its value
changes across salespersons and days of operation.
We are interested in “the average time spent by entire population of the company’s
salespersons in with pseudo customers”. This is the population parameter, i.e., the
population mean (denoted by µ). The numerical value of µ is unknown.
We estimate the value of an unknown population parameter using data from a
representative sample and determining the appropriate sample statistic. In this case the
appropriate sample statistic is the sample mean. We call it the sample estimate.
Obviously the sample estimate is not perfect in the sense that we cannot be 100% confident
that the population mean is exactly equal to the sample estimate. There is the inevitability
of sampling error. We must recognize a margin of error surrounding our sample estimate
that is based on (a) the variability in our sampling process and (b) the level of confidence
we desire. Therefore, we estimate the unknown population mean not by a single number
(i.e., the sample estimate) but by an interval called the “Confidence Interval”.
A confidence interval for the population mean is given by
Sample Estimate ± [Margin of Error]
The margin of error is a product of two components:
Margin of Error = Confidence Factor x Standard Deviation of the Sample Estimate
Page 33
START WORKSHOP - STATISTICS
In the present case, the Sample estimate = Sample mean “= AVERAGE (Data)”
Standard deviation of the estimate (i.e. Sample mean) is given by the formula
𝑆𝐷𝑋� =
In this formula, while we know that
𝜎
√𝑛
𝑛 = 100. But, we do not know 𝜎
or the population
standard deviation. (In fact, except in some rare situations, 𝜎 is usually unknown.) So how
do we proceed? Statisticians have offered a solution to this problem. Recognize that
𝜎 is
essentially a measure of the variability in population. If we don’t know its numerical value,
the best we could do is to substitute the measure of variability in the sample that is
representative of the population. This measure is the sample standard deviation (denoted
by s). We can easily calculate (or let EXCEL calculate) the numerical value of the sample
standard deviation from the sample data. But this substitution comes with some
adjustments.
First, we use a slightly different terminology. Instead of using the term “Standard Deviation
of the sample estimate” we use the term “Standard Error of the sample estimate”. We use
the notation 𝑆𝐸𝑋� .
The standard error of estimate for the mean given by the following formula:
𝑆𝐸𝑋� =
We can compute the standard error easily.
𝑠
√𝑛
Page 34
START WORKSHOP - STATISTICS
Second, − more important − what about the confidence factor? If we use the above formula,
the appropriate sampling distribution will no longer be the normal distribution that we
used in the previous discussion. This means that, we cannot find the confidence factors
using the Z-distribution functions. We must use a slightly different distribution called the
T-distribution. But it is not a big deal!
What it means is just this: to find the value of the confidence factor we should use
the T-distribution rather than the Z-distribution. Using T-distribution functions is very
much like using the normal and standard normal distribution functions, although the way
EXCEL’s T-distribution functions work is somewhat different. We must learn how to use
the T-distribution functions in EXCEL.
A key difference between the two distributions is this: Whereas there is a unique
standard normal distribution − no matter what the sample size is − the T-distribution
depends on the sample size. More precisely it depends on (n−1), which is usually called the
degrees of freedom. A T-distribution resembles the Z-distribution but has thicker ‘tails’. A
T-distribution with large degrees of freedom closely resembles the standard normal
distribution.
We will now learn how to use the EXCEL functions related to the T-distribution.
Page 35
START WORKSHOP - STATISTICS
T.INV.2T function
Page 36
START WORKSHOP - STATISTICS
T.DIST.2T function
Page 37
T.INV function
START WORKSHOP - STATISTICS
Page 38
T.DIST function
START WORKSHOP - STATISTICS
Page 39
START WORKSHOP - STATISTICS
T.DIST.RT function
Page 40
START WORKSHOP - STATISTICS
Beck to the Case, How to find the 95% Confidence Factor?
For a confidence level is 95% the T-value must be such that 5% of the area (equally divided
in the two-tails) under the T-distribution curve falls outside this value.
Visualization:
Finding the desired T-value:
Use the T.INV.2T function, enter 0.05 for probability and (100 − 1) or 99 for Degrees of
Freedom or enter “=T.INV.2T(0.05, 99)” in any cell.
We get the desired confidence level factor = 1.98421
Now, we have all the three pieces required to build the 95% Confidence Interval.
Sample Estimate ± Confidence Factor x Standard Error of the Sample Estimate
Answer: A 95% confidence interval for the population mean time spent by salespersons with
pseudo customers is: [4362 seconds, 5398 seconds]. We are 95% confident that this
interval contains the true population mean.
Page 41
START WORKSHOP - STATISTICS
NOTE
HYPOTHESIS TESTING - UNKNOWN POPULATION MEAN
Guided Case Analysis: Retail Store Operation
ELCO is a chain of stores that sells consumer electronics in the US. ELCO suspects that the
main reasons for declining profits are the falling quality of service and growing
competition. Managers at ELCO want to know the average time that a salesperson spends
with customers. They are worried about pseudo customers who take up a salesperson’s
time to get details about a product but then make their purchases elsewhere. ELCO
managers are concerned about the average time spent by the sales persons with pseudo
customers across the entire population of sales persons in the company.
The manager of the Human Resources Department has claimed that average time spent by
salespersons with pseudo customers is equal to 15% of an 8-hour work day time or 4320
seconds. Senior managers at ELCO are wondering how they should react to the claim made
by the HR Department manager – Should they reject the claim or not reject it? They deem
that the decision to reject the claim must be at high level significance of 1 percent or 0.01.
ELCO has collected data on a simple random sample of 100 observations (service time
spent with pseudo customers in a day by 100 salespersons). The data set is in the EXCEL
file service.
Page 42
START WORKSHOP - STATISTICS
Analysis
We first realize that this is a hypothesis test of a claim made about the unknown population
mean.
Step 1: Statement of the Hypotheses
H0:
µ = 4320
Ha:
µ ≠ 4320
It is a two-tailed test. Significance level α = 0.01
Step 2: Test Statistic
𝑇𝑆 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒−𝑉𝑎𝑙𝑢𝑒 𝑐𝑙𝑎𝑖𝑚𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒
This Test Statistic follows a T-distribution with (n-1) = (100-1) = 99 degrees of freedom
Step 3: Decision Criteria
1. Critical Values of the Test Statistic and the Rejection Region
2. P-value
Step 4: Statistical Decision
Step 5: Managerial Conclusion
Page 43
START WORKSHOP - STATISTICS
Let us do the computations in EXCEL and complete the above steps.
EXCEL Worksheet
Hypothesis Testing of Population
Mean - Two-tailed Test
Claimed value of the Population Mean
µ0
Type of test
Direction of alternate hypothesis: ≠
Significance Level
α
(1-α)
Confidence Level
Sample size
Sample Estimate = Sample Mean
Sample Standard Deviation, s
n = COUNT(A2:A101)
Standard Error of the Sample Estimate, SE
Observed Sample Test Statistic
Degrees of Freedom
Critical value of Test Statistic for the level
of significance
P-value of the Observed Sample Test
Statistic = the area in the two tails beyond
the Absolute Value of the Observed Test
Statistic
=AVERAGE(A2:A101)
=STDEV.S(A2:A101)
Sample Standard Deviation/ SQRT(n)
T = (Observed value - Claimed value)/
Standard Error of the Estimate
(n-1)
T*=T.INV.2T(α, df)
=T.DIST.2T(Observed Sample Test
Statistic, df)
Page 44
4320
2-Tail
0.01
0.99
100
4880.03
2610.622
261.0622
2.145197
99
2.626405
0.034383
START WORKSHOP - STATISTICS
Appendix 1: Normal Distribution
The normal distribution is a continuous probability distribution useful in
describing many real-world situations. It is also a very important
distribution in statistical applications. There is actually a family of normal
distributions, with each distribution completely specified by the values of
two parameters, the mean, µ, and the standard deviation, σ.
Every normal distribution is symmetric and centered at its mean. The
standard deviation determines how spread out are the values in the
distribution.
Although the limits of any normal distribution are, in theory, ± ∞, 99.7% of
the values are within ± 3σ of µ.
Probability in a normal distribution is determined as the area under the
normal curve.
The total area (total probability) under the normal curve is 1.
The Standard Normal Distribution
If X is normal with mean, µ, and standard deviation, σ, then
Z=
X -µ
σ
is called standard normal. A probability statement about any normal
random variable X can be transformed into an equivalent probability
statement about the standard normal random variable Z. Z-distribution has:
Mean: μ = 0
Standard Deviation: σ = 1.0
Z will always be used to represent a standard normal random variable.
Probabilities under the standard normal curve have been tabulated and are
shown in a table of standard normal probabilities.
Note that P(Z ≤ z) = P(Z < z) (including or excluding a single number does
not change the probability)
Page 45
START WORKSHOP - STATISTICS
Cumulative Standard Normal (CSN) Table: P(Z < z)
Z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
0
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.9821
0.9861
0.9893
0.9918
0.9938
0.9953
0.9965
0.9974
0.9981
0.9987
0.9990
0.9993
0.9995
0.9997
0.9998
0.9998
0.9999
0.9999
1.0000
1.0000
0.01
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.9463
0.9564
0.9649
0.9719
0.9778
0.9826
0.9864
0.9896
0.9920
0.9940
0.9955
0.9966
0.9975
0.9982
0.9987
0.9991
0.9993
0.9995
0.9997
0.9998
0.9998
0.9999
0.9999
1.0000
1.0000
0.02
0.5080
0.5478
0.5871
0.6255
0.6628
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.9830
0.9868
0.9898
0.9922
0.9941
0.9956
0.9967
0.9976
0.9982
0.9987
0.9991
0.9994
0.9995
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
0.03
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.9484
0.9582
0.9664
0.9732
0.9788
0.9834
0.9871
0.9901
0.9925
0.9943
0.9957
0.9968
0.9977
0.9983
0.9988
0.9991
0.9994
0.9996
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
0.04
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.9495
0.9591
0.9671
0.9738
0.9793
0.9838
0.9875
0.9904
0.9927
0.9945
0.9959
0.9969
0.9977
0.9984
0.9988
0.9992
0.9994
0.9996
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
Page 46
0.05
0.5199
0.5596
0.5987
0.6368
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.9505
0.9599
0.9678
0.9744
0.9798
0.9842
0.9878
0.9906
0.9929
0.9946
0.9960
0.9970
0.9978
0.9984
0.9989
0.9992
0.9994
0.9996
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
0.06
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.9515
0.9608
0.9686
0.9750
0.9803
0.9846
0.9881
0.9909
0.9931
0.9948
0.9961
0.9971
0.9979
0.9985
0.9989
0.9992
0.9994
0.9996
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
0.07
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.9525
0.9616
0.9693
0.9756
0.9808
0.9850
0.9884
0.9911
0.9932
0.9949
0.9962
0.9972
0.9979
0.9985
0.9989
0.9992
0.9995
0.9996
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
0.08
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.9535
0.9625
0.9699
0.9761
0.9812
0.9854
0.9887
0.9913
0.9934
0.9951
0.9963
0.9973
0.9980
0.9986
0.9990
0.9993
0.9995
0.9996
0.9997
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
0.09
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
0.9545
0.9633
0.9706
0.9767
0.9817
0.9857
0.9890
0.9916
0.9936
0.9952
0.9964
0.9974
0.9981
0.9986
0.9990
0.9993
0.9995
0.9997
0.9998
0.9998
0.9999
0.9999
0.9999
1.0000
1.0000
START WORKSHOP - STATISTICS
Examples Using the Standard Normal Table
a.
Find P(Z < 1.00)
b. Find P(0 < Z < 1)
c.
Find P(-1.3 < Z < 2.0)
d.
Find P(-1.57 < Z < -0.82)
Page 47
START WORKSHOP - STATISTICS
e.
Find P(Z < -2.53)
f.
P(Z < -6)
g.
Find z so that a probability of 5% falls above (to the right) of that value:
P(Z > z) = 0.05.
Page 48
START WORKSHOP - STATISTICS
Using EXCEL to Find the Desired Probabilities
Let us use the following example to provide us the context to illustrate the use of EXCEL to
find probabilities of our interest.
Example:
Each year thousands of high school students take the Scholastic Aptitude Test (SAT). The
distribution of the scores on each SAT is approximately unimodal and symmetric and it is
well described by a Normal model with a mean of 500 and a standard deviation of 100.
1. Suppose a student scored 600 on an SAT test. Where does this student stand among all the
students that took this SAT?
2. What proportion of students scored between 450 and 600 on this SAT?
3. Suppose a college sys it accepts only students with SAT scores among the top 10%. How
high should a student’s SAT score be in order to be accepted at this college?
Solution Procedure
First, visualize the situation by drawing a picture of the Normal distribution and marking
out the desire probability (i.e., the area under the normal curve).
To find areas (or probabilities) we use the NORM.DIST function in EXCEL.
The function NORM.DIST(X, mean, std-dev, TRUE) gives the area under the normal curve
to the LEFT of the value that you input for “X”.
You may follow one of two alternative approaches:
1. In any EXCEL cell enter the formula “=NORM.DIST(X, mean, std-dev, TRUE)” with
appropriate numerical values and then hit “enter”
2. With the cursor on any EXCEL cell, click on the built-in function button and choose
NORM.DIST from the menu of “Statistical” functions. This will open the dialog box
and you can fill the appropriate numerical values.
Page 49
Answer to Question 1
START WORKSHOP - STATISTICS
Visualization
Page 50
START WORKSHOP - STATISTICS
Answer: The student’s score of 600 is such that about 84% were below his score.
Page 51
Answer to Question 2
START WORKSHOP - STATISTICS
Visualization
The desired area is the difference between two areas. We can find these separately and
then do the subtraction. Or we can directly enter the formula that represents the
subtraction into an EXCEL cell and get the answer.
Answer: 53.28% of the students scored in the range between 450 and 600.
Page 52
Answer to Question 3
START WORKSHOP - STATISTICS
In this case, we know the probability (or the area) and we must find a corresponding score
(or the X-value). Specifically, we must find X such that the area under the curve to the right
of X is 10% or equivalently the area to the left of X should be equal to 90%.
NORM.INV(Probability, mean, standard-dev) gives the X value for which the area under
the normal curve to the LEFT is equal to the value you input for probability.
Answer: The cutoff score at this college is 628.
Page 53
START WORKSHOP - STATISTICS
Appendix 2: Practice Problems
1. Filling Tide Detergent Bottles
Proctor and Gamble manufactures liquid Tide detergent (among many other
products). Liquid Tide is sold in plastic bottles. One of the final steps in the
manufacturing process is to fill the bottles of Tide. One machine used to fill
the bottles is set to put an average of 100 ounces of Tide in each bottle.
However, this machine cannot be guaranteed to put exactly 100 ounces of
Tide in each bottle. Rather, the fill amount is known to follow a normal
distribution with mean of 100 ounces and standard deviation of 0.2 ounces.
Thus, some bottles will contain slightly more than 100 ounces and some
slightly less, even though these bottles will be labeled as “100 ounce” bottles.
a.
b.
c.
d.
e.
What is the probability that less than 99.6 ounces will be put into a “100
ounce” bottle of liquid Tide?
Calculate the probability that a single bottle of Tide will contain
between 99.9 and 100.1 ounces.
What is the 90th percentile of the fill amounts?
Suppose P&G can adjust the mean fill amount on the machine that fills
the Tide bottles. At what value should the mean fill be set in order to
insure that only 5% of the Tide bottles will contain less than 99.8
ounces?
We plan to examine a random sample of 100 Tide bottles to assess the
operating efficiency of the machine. Calculate the probability that the
average fill for a random sample of 100 Tide bottles is between 99.9 and
100.1 ounces.
Page 54
START WORKSHOP - STATISTICS
2. Stereo Component Warranty
A company that produces an expensive stereo component is considering
offering a warranty on the component. Suppose the population of lifetimes of
the components is a normal distribution with a mean of 84 months and a
standard deviation of 7 months. If the company wants no more than 2% of the
components to wear out before they reach the warranty date, what number of
months should be used for the warranty? (Answer: 69.68 or 70 months)
Page 55
3. Textbook
START WORKSHOP - STATISTICS
A large required chemistry course at a state university has been using the
same textbook for a number of years. Over the years, the students have been
asked to rate this text on a 10-point scale, and the average rating has been
stable at about 5.2. This year the faculty decided to try a new text. After the
course, 35 randomly selected students were asked to rate this new text. The
results are shown below:
6
3
6
7
6
10
6
8
7
10
3
6
5
7
8
10
6
7
6
4
6
6
4
6
8
7
7
9
10
9
5
8
6
8
7
The sample mean of the 35 sample values is 6.77. The sample standard
deviation is 1.85. Do the data provide evidence that the average rating for the
new book is different from that of the old book (5.2)?
Page 56
Hypotheses:
START WORKSHOP - STATISTICS
H0:
Ha:
Decision Rule:
Test Statistic:
Decision:
Conclusion:
Page 57
START WORKSHOP - STATISTICS
4. Battery Lifetimes
DC Company makes batteries for cell phones. Recently, the R&D department
at the company came up with a new battery design that they believed would
last longer than batteries currently on the market. However, senior managers
were concerned about making the claim that the new battery lifetime was
greater, on average, than the current industry standard of 30 hours. They
believed there would be serious bad publicity and sales would decline if trade
publication tests showed otherwise. Before the new battery is put into
production, the company planned to test a random sample of 100 batteries.
The battery lifetime data is shown below:
42.2
30.9
27.4
35
26.8
33.2
38.5
31.2
29.4
25.2
34
26.1
28.4
31.3
39.1
32.8
24.6
30.2
19.6
37.9
29.9
42.8
30
30
32.8
34.6
37.8
26.4
32.2
32.9
26
39.8
28.7
30.5
26.6
31.8
31.1
34.3
22.3
29.6
29.6
30.3
35.3
34.3
32
29.5
27.5
21.7
21.6
35
19.7
38.3
32.1
26.3
30.7
30.7
29.6
26.8
25.1
33.3
26.5
29.5
31.6
31.3
38.8
31.4
28.9
26.6
33.9
25.1
35.7
37.2
28.9
32.2
30.1
32.3
32.7
28.2
26.2
30.4
29.7
29.1
44.1
28.2
30.5
30.8
29.4
33.7
29.2
25.6
29.4
28.1
26.1
26.4
30.3
30.9
21.3
37.3
24.8
38.2
Is there sufficient evidence to conclude that the population of new batteries
will have average lifetime greater than 30 hours? Set up the hypotheses and
conduct the test using α = .05. The sample standard deviation is 4.816 hours.
The sample mean of the 100 lifetimes is 30.639 hours.
Page 58
Hypotheses:
START WORKSHOP - STATISTICS
H0:
Ha:
Decision Rule:
Test Statistic:
Decision:
Conclusion:
Page 59