Download Business Statistics for Managerial Decision

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Business Statistics for Managerial
Decision
Farideh Dehkordi-Vakil
Comparing Two Proportions





We often want to compare the proportions of two
groups (such as men and women) that have some
characteristics.
We call the two groups being compared
Population 1 and population 2.
The two population proportions of “Successes” P1
and P2.
The data consist of two independent SRS
The sample sizes are n1 from population 1 and n2
from population 2.
Comparing Two Proportions


The proportion of successes in each sample
estimates the corresponding population
proportion.
Here is the notation we will use
population
population
proportion
Sample
size
Count of
successes
Sample
proportion
1
P1
n1
X1
pˆ1  X1 n1
2
P2
n2
X2
pˆ 2  X 2 n2
Sampling Distribution of



pˆ1  pˆ 2
Choose independent SRS of sizes n1 and n2 from
two populations with proportions P1 and P2 of
successes.
Let D  pˆ1  pˆ 2 be the difference between the two
sample proportions of successes.
Then as both sample sizes increase, the sampling
distribution of D becomes approximately Normal.


The mean of the sampling distribution is P1  P2 .
The standard deviation of the sampling distribution is
D 
P1 (1  P1 ) P2 (1  P2 )

n1
n2
Sampling Distribution of


The sampling distribution
of the difference of two
sample proportions is
approximately Normal.
The mean and standard
deviation are found from
the two population
proportions of successes,
P1 and P2
pˆ1  pˆ 2
Confidence Interval


Just as in the case of estimating a single
proportion, a small modification of the
sample proportions greatly improves the
accuracy of confidence intervals.
The Wilson estimates of the two population
proportions are
~
P1  ( X 1  1) (n1  2)
~
p2  ( X 2  1) (n2  2)
Confidence Interval

~ is approximately
The standard deviation of D
 D~ 

~
p1 (1  ~
p2 ) ~
p2 (1  ~
p2 )

n1  2
n2  2
To obtain a confidence interval for P1-P2, we
replace the unknown parameters in the standard
deviation by estimates to obtain an estimated
standard deviation, or standard error.
Confidence Interval for Comparing
Two Proportions
Example:”No Sweat” Garment Labels

Following complaints about the working
conditions in some apparel factories both in the
United States and Abroad, a joint government and
industry commission recommended in 1998 that
companies that monitor and enforce proper
standards be allowed to display a “No Sweat”
label on their product. A survey of U.S. residents
aged 18 or older asked a series of questions about
how likely they would be to purchase a garment
under various conditions.
Example:”No Sweat” Garment Labels

For some conditions, it was stated that the
garment had a “No Sweat” label; for others,
there was no mention of such label. On the
basis of of the responses, each person was
classified as a “label user” or “ a “label
nonuser.” About 16.5% of those surveyed
were label users. One purpose of the study
was to describe the demographic
characteristics of users and nonusers.
Example:”No Sweat” Garment Labels

The study suggested that there is a gender
difference in the proportion of label users.
Here is a summary of the data. Let X denote
the number of label users.
population
1 (women)
2 (men)
n
296
251
X
63
27
pˆ  X n
0.213
0.108
~
p  ( X  1) (n  2)
0.215
0.111
Example:”No Sweat” Garment Labels

First calculate the standard error of the observed
difference.
SED~ 


~
p1 (1  ~
p2 ) ~
p2 (1  ~
p2 )

n1  2
n2  2
(0.215)(0.785) (0.111)(0.889)

 0.0308
296  2
251  2
The 95% confidence interval is
(~
p1  ~
p2 )  z * SED~
 (0.215  0.111)  (1.96)(0.0308)
 .104  0.060  (0.04, 0.16)
Example:”No Sweat” Garment Labels





With 95% confidence we can say that the difference in the
proportions is between 0.04 and 0.16.
Alternatively, we can report that the women are about 10%
more likely to be label users than men, with a 95% margin
of error of 6%.
In this example we chose women to be the first population.
Had we chosen men as the first population, the estimate of
the difference would be negative (-0.104).
Because it is easier to discuss positive numbers, we
generally choose the first population to be the one with the
higher proportion.
The choice does not affect the substance of the analysis.
Significance Tests


It is sometimes useful to test the null hypothesis
that the two population proportions are the same.
We standardize D  pˆ  pˆ by subtracting its mean
P1-P2 and then dividing by its standard deviation
1
D 


2
P1 (1  P1 ) P2 (1  P2 )

n1
n2
If n1 and n2 are large, the standardized difference
is approximately N(0, 1).
To estimate D we take into account the null
hypothesis that P1 = P2.
Significance Tests


If these two proportions are equal, we can
view all of the data as coming from a single
population.
Let P denote the common value of P1 and
P2. The standard deviation of D  pˆ  pˆ is then
1
 Dp 
P(1  P) P(1  P)

n1
n2
1 1
 P(1  P)  
 n1 n2 
2
Significance Tests

We estimate the common value of P by the overall
proportion of successes in the two samples.
number of successes in both samples
X  X2
Pˆ 
 1
number of observatio ns in both samples
n1  n2




This estimate of P is called the pooled estimate.
To estimate the standard deviation of D, substitute p̂
for P in the expression for DP.
The result is a standard error for D under the condition that
the null hypothesis H0: P1 = P1 is true.
The test statistic uses this standard error to standardize the
difference between the two sample proportions.
Significance Tests for Comparing Two
Proportions
Example:men, women, and garment labels.


The previous example presented the survey data
on whether consumers are “label users” who pay
attention to label details when buying a shirt. Are
men and women equally likely to be label users?
Here is the data summary:
Population
n
X
1 (women)
2 (men)
296
251
63
27
pˆ  X n
0.213
0.108
Example:men, women, and garment labels

We compare the proportions of label users in the
two populations (women and men) by testing the
hypotheses
H0:P1= P2
Ha:P1  P2

The pooled estimate of the common value of P is:
pˆ 

63  27
90

 0.1645
296  251 547
This is the proportion of label users in the entire
sample.
Example:men, women, and garment labels

The test statistic is calculated as follows:
1 
 1
SEDP  (0.1645)(0.8355)

  0.03181
 296 251 
z

pˆ 1  pˆ 2 0.213  0.108

 3.30
SEDP
0.03181
The observed difference is more than 3 standard
deviation away from zero.
Example:men, women, and garment labels

The P-value is:
2  P( z  3.30)  2  (1  0.9995)  2  0.0005  0.001

Conclusion:

21% of women are label users versus only 11%
of men; the difference is statistically significant.
Simple Regression



Simple regression analysis is a statistical tool That
gives us the ability to estimate the mathematical
relationship between a dependent variable (usually
called y) and an independent variable (usually
called x).
The dependent variable is the variable for which
we want to make a prediction.
While various non-linear forms may be used,
simple linear regression models are the most
common.
Introduction
• The primary goal of
quantitative analysis is to use
current information about a
phenomenon to predict its
future behavior.
• Current information is usually
in the form of a set of data.
• In a simple case, when the data
form a set of pairs of numbers,
we may interpret them as
representing the observed
values of an independent (or
predictor ) variable X and a
dependent ( or response)
variable Y.
lot size Man-hours
30
73
20
50
60
128
80
170
40
87
50
108
60
135
30
69
70
148
60
132
Introduction
The goal of the analyst
who studies the data is to
find a functional relation
y  f (x)
between the response
variable y and the
predictor variable x.
Statistical relation between Lot size and Man-Hour
180
160
140
120
Man-Hour

100
80
60
40
20
0
0
10
20
30
40
50
Lot size
60
70
80
90
Regression Function
The statement that the relation
between X and Y is statistical
should be interpreted as providing
the following guidelines:
1. Regard Y as a random variable.
2. For each X, take f (x) to be the
expected value (i.e., mean value) of
y.
3. Given that E (Y) denotes the
expected value of Y, call the
equation

E (Y )  f ( x)
the regression function.
Historical Origin of Regression


Regression Analysis was
first developed by Sir
Francis Galton, who
studied the relation
between heights of sons
and fathers.
Heights of sons of both
tall and short fathers
appeared to “revert” or
“regress” to the mean of
the group.
Basic Assumptions of a Regression
Model

A regression model is based on the following assumptions:
1. There is a probability distribution of Y for each level
of X.
2. Given that y is the mean value of Y, the standard
form of the model is
Y  f (x )  
where  is a random variable with a normal
distribution.
Statistical relation between Lot Size and number of
man-Hours-Westwood Company Example
Statistical relation between Lot size and number of Man-Hours
180
160
140
120
100
80
60
40
20
0
0
10
20
30
40
50
60
70
80
90
Pictorial Presentation of Linear Regression
Model
Construction of Regression Models



Selection of independent variables
Functional form of regression relation
Scope of model
Uses of Regression Analysis


Regression analysis serves Three major
purposes.
1. Description
2. Control
3. Prediction
The several purposes of regression analysis
frequently overlap in practice
Formal Statement of the Model

General regression model
Y   0  1 X  
1. 0, and 1 are parameters
2. X is a known constant
2
3. Deviations  are independent N(o,  )
Meaning of Regression Coefficients


The values of the regression parameters 0,
and 1 are not known.We estimate them
from data.
1 indicates the change in the mean
response per unit increase in X.
Regression Line

If the scatter plot of our sample data
suggests a linear relationship between two
variables i.e.
y   0  1 x

we can summarize the relationship by
drawing a straight line on the plot.
Least squares method give us the “best”
estimated line for our set of sample data.
Regression Line

We will write an estimated regression line
based on sample data as
yˆ  b0  b1 x

The method of least squares chooses the
values for b0, and b1 to minimize the sum of
squared errors
n
n
i 1
i 1
2
SSE   ( yi  yˆ i ) 2   yi  b0  b1 xi 
Regression Line

Using calculus, we obtain estimating
formulas:
n
b1 
 (x
i
i 1
 x )( yi  y )
n
 (x
i 1
i
 x )2
b0  y  b1 x
Estimation of Mean Response


Fitted regression line can be used to estimate the mean
value of y for a given value of x.
Example
 The weekly advertising expenditure (x) and weekly
sales (y) are presented in the following table.
y
1250
1380
1425
1425
1450
1300
1400
1510
1575
1650
x
41
54
63
54
48
46
62
61
64
71
Point Estimation of Mean Response

From previous table we have:
 x  564
 x  32604
 y  14365
 xy  818755
n  10

2
The least squares estimates of the regression
coefficients are:
b1 
n xy   x y
n x 2  (  x ) 2

10(818755)  (564)(14365)
 10.8
10(32604)  (564) 2
b0  1436.5  10.8(56.4)  828
Point Estimation of Mean Response

The estimated regression function is:
ŷ  828  10.8x
Sales  828  10.8 Expenditur e

This means that if the weekly advertising
expenditure is increased by $1 we would expect
the weekly sales to increase by $10.8.
Point Estimation of Mean Response


Fitted values for the sample data are
obtained by substituting the x value into the
estimated regression function.
For example if the advertising expenditure
is $50, then the estimated Sales is:
Sales  828  10.8(50)  1368

This is called the point estimate of the mean
response (sales).
Residual

The difference between the observed value
yi and the corresponding fitted value .
ˆi
ei  yi  y

Residuals are highly useful for studying
whether a given regression model is
appropriate for the data at hand.
Example: weekly advertising expenditure
y
1250
1380
1425
1425
1450
1300
1400
1510
1575
1650
x
41
54
63
54
48
46
62
61
64
71
y-hat
1270.8
1411.2
1508.4
1411.2
1346.4
1324.8
1497.6
1486.8
1519.2
1594.8
Residual (e)
-20.8
-31.2
-83.4
13.8
103.6
-24.8
-97.6
23.2
55.8
55.2