Download Statistics 51-651-02

Document related concepts

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Opinion poll wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
MBA
Statistics 51-651-00
http://www.hec.ca/sites/cours/51-651-02/
What is statistics?
"I like to think of statistics as the science
of learning from data... Statistics is
essential for the proper running of
government, central to decision making
in industry, and a core component of
modern educational curricula at all
levels."
Jon Kettenring
ASA President, 1997
2
What is statistics?

American Heritage Dictionary® defines
statistics as:
"The mathematics of the collection,
organization, and interpretation of numerical
data, especially the analysis of population
characteristics by inference from sampling.«

The Merriam-Webster’s Collegiate
Dictionary® definition is:
"A branch of mathematics dealing with
the collection, analysis, interpretation, and
presentation of masses of numerical data."
3
Course syllabus
Variation. Sampling and estimation.
 Decision making from statistical
inference.
 Qualitative data analysis.
 Simple and multiple linear regression.
 Forecasting.
 Statistical process control.
 Revision.

4
EVALUATION
 Teamwork:
40%
 Final
60%
exam:
5
COURSE # 1
Variation, sampling
and estimation.
Variation
"The central problem in management
and in leadership ... is failure to
understand the information in
variation"
W. Edwards Deming
7
Variation
"Management takes a major step
forward when they stop asking you to
explain random variation"
F. Timothy Fuller
8
Variation
"Failure to understand variation is a
central problem of management"
Lloyd S Nelson
9
Airport Immigration
Officer
Alan
Barbara
Colin
Dave
Enid
Frank
Passengers
processed
9
10
4
8
6
14
10
Airport Immigration
Management expected their officers to
process 10 passengers during this period
 The immigration services manager, in
reviewing these figures, was

 concerned
about the performance of
Colin
 thinking how best to reward Frank
11
Debt Recovery



When the amount of recovered debt is
much lower than the target recovery level
of 80%, the General Manager visits all the
District Offices in New Zealand to remind
managers of the importance of customers
paying on time
What do you think of the GM policy?
What would you do?
12
Budget Deviations
Budget deviations measure the
difference between the amount
budgeted and the actual amount,
expressed as a percentage of the
budgeted amount.
 The aim is to have a zero deviation.
 Most of the variation lies between
-3% and 4%.

13
Illustration of variation
Excel program:beads.xls (Deming)
The red balls are associated with
defective products.
Five times a day, 5 technicians select a
sample of 50 beads and counts the
number of defectives (red).
Beads History – 17 July 2000
Number of red beads
20
15
10
5
0
15
Beads History - 9 March 2000
Number of red beads
20
15
10
5
0
16
Beads History - 8 March 2001
Number of red beads
20
15
10
5
0
17
Beads History - 5 March 1999
Number of red beads
20
15
10
5
0
18
Beads History - 19 July 1996
Number of red beads
20
15
10
5
0
19
Beads History - 8 March 1996
Number of red beads
20
15
10
5
0
20
Beads History - 10 March 1995
Number of red beads
20
15
10
5
0
21
Beads History - 6 March 1998
Number of red beads
20
15
10
5
0
22
Beads History: 27 Experiments
Number of red beads
25
20
15
10
5
0
23
Beads Averages
Number of red beads
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
24
Discussion

What is the main difference between
the graph of the 25x27=675 draws
and the graph of the 27 averages?
25
Two approaches in
management

Fire-fighting

Scientific
26
Fire-fighting approach
Problem
Solution
27
Scientific Approach
Problem
Solution
Cause
28
Scientific Approach
Making decisions based on data
rather than hunches.
 Looking for the root causes of
problems rather than reacting to
superficial symptoms.
 Seeking permanent solutions rather
than quick fixes.

29
The Need for Data
To
 To
 To
 To
 To

understand the process
determine priorities
establish relationships
monitor the process
eliminate causes of variation
30
The steps of statistical
analysis involve:
1.
2.
3.
4.
Planning the collection of
information
Collecting information
Evaluating information
Drawing conclusions
31
Surveys:
Collect information from a carefully
specified sample and extend the
results to an entire population.
 Sample surveys might be used to:

Determine which political candidate is
more popular
 Discover what foods teenagers prefer
for breakfast
 Estimate the number of potential clients

32
Sampling Definitions
Population
Parameter
choose
estimate
Sample
Statistic
calculate
33
Government Operations:
Conduct experiments to aid in the
development of public policy and
social programs.
 Such experiments include:

consumer prices;
 fluctuations in the economy;
 employment patterns;
 population trends.

34
Scientific Research:

Statistical sciences are used to enhance
the validity of inferences in:





radiocarbon dating to estimate the risk of
earthquakes;
clinical trials to investigate the
effectiveness of new treatments;
field experiments to evaluate irrigation
methods;
measurements of water quality;
psychological tests to study how we reach
the everyday decisions in our lives.
35
Business and Industry:
predict the demand for products and
services;
 check the quality of items
manufactured in a facility;
 manage investment portfolios;
 forecast how much risk activities
entail, and calculate fair and
competitive insurance rates.

36
Sampling
 Our knowledge, our attitudes and
our actions are mainly based on
samples.
 For example, a person’s opinion of
an institution or a company which
makes thousands of transactions
every day is often determined by
only one or two meetings with this
institution.
37
Census vs Sample
Census = reality (True or false?!)
 The information needed is available
for all individuals of the study
population.
 Sample = estimation of the reality
 The information needed is only
available for a subset of the
individuals of the study population.

38
Advantages of a sample
Reduced costs
 Accrued speed
 Offers more possibilities

in some cases it may be impossible to
have a census (ex: quality control)

Perhaps more precise!
Cases where highly qualified personal
are necessary for collecting data
39
Probabilistic vs non
probabilistic samples
Sampling methods
Probabilist
simple random
systematic
stratified
Non probabilist
cluster(area)
on a voluntary basis
purposive
40
Sampling errors

Random error
•

different samples will produce different
estimates of the study population
characteristics
Systematic error - bias
•
•
•
non probabilistic sample
probabilistic sample with a high rate of
non respondents
biased instrument of measure
41
TV Show Poll - March 1998

Should Hamilton be renamed
Waikato City?
4400 dialled the 0900 number
 73% were against the change

What type of sample was taken?
 What conclusions would you draw?

42
Bias vs variability
Bias is a systematic error, in the
same direction, of successive
estimations of a parameter.
 Large variability means that
repeated values of estimations are
scattered; the results of successive
sampling cannot be reproduced.
 (see …)

43
a) large bias, low variability
b) low bias, high variability
c) large bias, high variability
d) low bias, low variability
44
Bias due to non-response




Bias is often caused by non-response in
surveys.
For example, suppose that the population is
divided in two groups : respondents (60%)
and non-respondents (40%).
Within respondents, 65% are in favour of a
project et within non-respondents, 20% are in
favour.
The real proportion in the population in favour
of the project is p = 47% , while a survey will
give an estimation of p at about 65%  47%.
The bias is 18%.
45
How do we make a simple
random sample drawing?
1.
2.
3.
We need a list. Each element of the
population is assigned a number from 1
to N.
We use a computer program to select n
numbers as randomly as possible (ex:
Excel, MINITAB, SAS, C).
The corresponding elements form the
sample.
46
Notes :




The results obtained depend on the sample
taken.
If the samples are taken according to codes of
practice, the results should all be similar.
For a simple random draw, each individual of
the population is as likely to be selected at
each draw.
For a simple random draw, there are many
different possible samples. All possible
samples of the same size have the same
chance of being selected.
48
Opinion polls
The results obtained in a
probabilistic sample will be used to
generalize the entire population.
 But the fact of using a sample
necessarily induces a margin of
error that we will try to control.
 We will distinguish two types of
data: qualitative and quantitative.

49
Types of data

Qualitative
(measurement scale: nominal or ordinal)
(parameter: %)

Examples:
•
•
•
•

sex (F, M)
political party (PLQ, PQ, ADQ)
preferred brand (Coke, Pepsi, Homemade brand, …)
satisfaction level (Likert scale from 1 to 5)
Quantitative

(measurement scale: interval or ratio)
(parameter: mean)
Examples:
• age
• income
• temperature (in degrees Celsius)
50
Case study




Data in credit.xls represent the credit balance
and the total income of 100 randomly chosen
families in Quebec.
What is the mean credit balance for a family in
Quebec? What is the precision (margin of error)
of your estimate?
What about a Canadian family?
Assuming that 2 500 000 families use at least
one credit card regularly, what is the total debt
of families in Quebec? What is the precision of
the estimate?
51
Confidence intervals

To estimate the proportion p of
individuals with a given characteristic
among the population,

or to estimate the mean  of a given
quantitative variable,

one uses a confidence interval at the
(1- ) level.
52
Confidence intervals
(continued)

It consists of constructing an interval of
values which enables one to affirm,
with a certain level of confidence (in
general: 90%, 95% or 99%), that the
true value of the parameter for the
population, is included in this interval.

Illustration: Confidence interval applet
53
Confidence interval for
estimating a proportion p


Example: In a sample of 125 college
students who were questioned on their
intentions to vote in the next election, 45
answered positively.
Estimate, in a specific way, the proportion
of the entire student population of this
institution who intend to vote at the next
elections.
54
Confidence interval for estimating a
proportion (continued)
Excel program: proportion1.xls
In general, if the sample size n is large enough, the (1 - )
confidence interval to estimate the true proportion p of the studied
characteristic in the population is given by:
z/2 , read in the normal distributi on table, is such as,
P[ - z/2  Z  z/2 ]  1 -  ;
if   5%, then z/2  1.96 .
55
Example (continued) :

Consequently, a confidence interval of 95%
certainty for the proportion of the entire
student population of this institution who
intend to vote at the next election is given
by:
56
Example (continued) :
How would we report the results of
this survey in the student newspaper
of this college?
 36% of the students of this college
intend to exercise their voting rights
at the next student election. The
margin of error is 8.4% with a 95%
degree of confidence (or with 95%
certainty or 19 times out of 20).

57
Notes:
This
is an approximate formula and applies
only for large samples.
If
all possible random samples of size n are
taken and their 95% confidence interval
calculated, 95% of them will include the
true proportion p of the population, and
thus 5% will not include it.
The
quantity
is called the
margin of error and is used to establish the
95% confidence level (19 times out of 20).
58
Margin of error at the 95% level
sample size n and the value of p at the 95% level
p (%)
10
20
30
40
50
60
70
80
90
100
5,9
7,8
9,0
9,6
9,8
9,6
9,0
7,8
5,9
300
3,4
4,5
5,2
5,5
5,7
5,5
5,2
4,5
3,4
size of the sample n
500
1000
2,6
1,9
3,5
2,5
4,0
2,8
4,3
3,0
4,4
3,1
4,3
3,0
4,0
2,8
3,5
2,5
2,6
1,9
3000
1,1
1,4
1,6
1,8
1,8
1,8
1,6
1,4
1,1
10000
0,6
0,8
0,9
1,0
1,0
1,0
0,9
0,8
0,6
59
Margin of error at the 90% level
Margin of error or precision (in %) according to the sample
size n and the value of p at the 90% level
p (%)
10
20
30
40
50
60
70
80
90
100
4,9
6,6
7,5
8,1
8,2
8,1
7,5
6,6
4,9
300
2,8
3,8
4,4
4,7
4,7
4,7
4,4
3,8
2,8
sample size n
500
1000
2,2
1,6
2,9
2,1
3,4
2,4
3,6
2,5
3,7
2,6
3,6
2,5
3,4
2,4
2,9
2,1
2,2
1,6
3000
0,9
1,2
1,4
1,5
1,5
1,5
1,4
1,2
0,9
10000
0,5
0,7
0,8
0,8
0,8
0,8
0,8
0,7
0,5
60
Calculation of size n to ensure a
maximum margin of error

If we want to estimate the
proportion p at a (1-) confidence
level with a maximum margin of
error e, then we have the following
relation for the calculation of the
sample size n :
61
Discussion

Take a look at the following survey:
Survey on California recall election

In the light of the recent results,
what can you say about the survey.
62
Confidence interval for
estimating the mean 
In a general way, if the sample size n is large enough,
the (1 - ) confidence interval for estimating the true
mean  of the population, is given by:
t /2, n -1 , read in the Student distributi on table , and such as,
P[ - t /2, n -1  T  t /2, n -1 ]  1 -  .
63
Notes:

This is an approximate formula that applies
for small samples when the characteristic is
has a normal distribution or from a large
sample (n ≥ 30).

When n is very large (n ≥ 100), the values of
t/2, n-1 and z/2 coincide.

Excel: Tools/Data Analysis/Descriptivre
Statistics or mean-1.xls
64
Notes: (continued)
Interpretation of a 95% confidence
interval for the mean  of a
characteristic in the population:
If all the random samples of size n
are taken and their confidence
intervals calculated, 95% of them will
include the true mean  of the
population, and thus 5% will not
include it.
Recall the confidence interval applet.
65
Confidence interval for :
Example
In order to know the weekly average
cost of the grocery basket for a family
of 4 people residing in Sherbrooke, we
take a sample of 50 of these families
and we note the amount of their
grocery for this week. We obtain an
average amount of 155$ with a
standard deviation estimate of 15$.
66
Example (continued) :
Estimate the current average cost of the grocery
basket for a family of 4 people residing in
Sherbrooke using a 95% confidence interval:
By stating that the current average cost of the
grocery basket of a family of 4 people residing at
Sherbrooke is included in the interval [150.74$;
159.26$], I am 95% certain to be right. My
prediction will be true 95% of time.
67
Example

A company wants to commercialize a new software
to get rid of junk mail. The potential market is 800
000 consumers.


Before starting selling the product, the company
realized a survey from a random sample of 40
families. Six families were interested in buying the
new software.


The net gain is 3$ per software and there are fixed
costs of 50 000$.


What is the decision?
68