Download ch16 - courses.psu.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
CHAPTER 16
0011 0010 1010 1101 0001 0100 1011
BIVARIATE STATISTICS:
PARAMETRIC TESTS
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
1
2
4
PPT-1
What The Experts Say
0011 0010 1010 1101 0001 0100 1011
What’s the point in doing surveys if you can’t analyze
the data? Converting and reducing data into
meaningful results is a marketing researcher’s key
responsibility.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
--SPSS Web Page, “Analysis,”
http://www.spss.com/spssmr/solutions/
analysis.htm, February 19, 2001.
PPT-2
Learning Objectives
0011 0010 1010 1101 0001 0100 1011
• Discuss the importance of parametric statistics
• Describe the difference between tests of differences and
tests of associations
• Explain how to use z- and t-tests to compare two groups
• Describe and calculate the F-test
• Discuss the meaning and use of analysis of variance
• Describe correlation and regression analyses
• Calculate and interpret correlation and regression statistics
• Compute one-way analysis of variance manually and by
computer
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-3
Get This! My Name Is Important to Me
0011 •0010
1101How
0001to
0100
In 1010
the book,
Win1011
Friends
and Influence People, Dale
Carnegie wrote, “Remember that a person’s name is to that
person the sweetest and most important sound in any language.”
• A professor classified students into three groups: names (those
he could remember), no-names (those he could not remember),
and neutral-names (those whose names he never made reference
to during the conversations).
• At the end of a meeting with each student, the professor would
state “Oh, I have to ask you something else. My wife is selling
cookies for the church. If you want any, they’re only 25 cents.”
This offer was made to examine if remembrance of a student’s
name made a difference regarding whether or not he or she
would comply with a request (that is, purchase the cookies).
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-4
Get This! My Name Is Important to Me –
cont’d
0011 0010 1010 1101 0001 0100 1011
• The results were analyzed using several different
statistical techniques, one being analysis of variance.
• He found:
1
2
– Not being able to remember a student’s name produced
compliance results (that is, purchasing the cookies) no
different from those of a condition in which the issue of
a student’s name was never raised.
– The higher purchasing rate for those students whose
names were remembered indicates that name
remembrance facilitates compliance.
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
4
PPT-5
Now Ask Yourself
0011 0010 1010 1101 0001 0100 1011
• Based on your knowledge of statistics, do you have faith in the
findings since various statistical tools were used to analyze the
data? Was it really necessary for the researchers to run statistical
tests to generate their findings?
• What was meant by, “The professor decided to use this method
[analysis of variance] since it tests whether there are statistically
significant differences among the means of each of the student
groups”?
• Were the results surprising to you? If so, what did you expect?
If not, why not?
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-6
Parametric Tests
The hypothesis tests assume that variables under
investigation are measured using either interval or ratio
scales. Furthermore, it is necessary to make some
additional assumptions.
0011 0010 1010 1101 0001 0100 1011
1
2
• The sample data should be randomly drawn from a
normally distributed population.
• The sample data drawn must be independent of each
other.
• When examining central tendency for which two or
more samples are drawn, the population should have
equal variances.
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
4
PPT-7
Tests of Difference
0011 0010 1010 1101 0001 0100 1011
Can be used whenever a researcher is interested in
comparing some characteristic of one group with a
characteristic of another and determining whether or not a
significant difference exists between the two groups.
•
•
•
•
1
2
the first population and its samples are identified by subscript 1
the second population and its samples are identified by subscript 2
 1 represents the mean of the sample drawn from population 1
 2 represents the mean of the sample drawn from population 2
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
4
PPT-8
Z-test: Difference Between Means
0011
Used to determine whether two population means differ from
each 1010
other.1101
This0001
can be
determined
by using either the z-test or t0010
0100
1011
test, depending on the sample size and whether or not the
population standard deviation is known for either group.
If the sample size is at least 30 and the population standard
deviations are known, the z-test should be used.
z
( X 1  X 2 )  ( 1   2 )
 ( x x )
1
where
2
where
 (x x ) 
1
2
 21
n1

 22
( X 1  X 2 ) = the difference between sample means
n2
1
4
(1   2 ) = the difference between population means
X 2 and X 1 = sample means for the two variables
 x x
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
2
[Formula 16-1]
= standard error of the difference between the means
PPT-9
t-Test: Difference Between Means
0011 0010 1010 1101 0001 0100 1011
When the sample size is less than 30 and the population
standard deviations are unknown, we can determine whether
or not a significant difference exists between two means (or
whether the two population means are equal).
t
where
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
1
( X 1  X 2 )  ( 1   2 )
s( x1  x2 )
s ( x1  x2 )
 n1 s1 2  n2 s 2 2
 
 n1  n2  2
 n1  n2

 n n
 1 2



2
4
PPT-10
Difference Between
Two Proportions and Independent Samples
0011 0010 1010 1101 0001 0100 1011
Let p1 and p2 be the proportions of two samples drawn from
respective populations with proportions P1 and P2 . The null
hypothesis is that there is no difference between the two
population proportions; that is P1 = P2 or stated another way, P1 P2 = 0. If the null hypothesis is true, P1 = P2 , the two populations
are really the same population.
The basic concept concerning the difference between two sample
proportions is analogous to that concerning the difference
between two sample means.
1
2
4
1. The mean of the sampling distribution (p1 - p2) is equal to
the difference between the two population proportions, P1
and P2, or p1 – p2 = P1 – P2.
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
PPT-11
Difference Between
Two Proportions and Independent Samples –
cont’d
2. 1010
The 1101
variance
the difference
0011 0010
0001of0100
1011
between two sample
proportions is the sum of variances of the two sample
proportions,
 2( p  p )   2 p   2 p 
1
2
1
2
Q1  1  P1
where
Q2  1  P2
P1Q1 P2 Q2

n1
n2
1
2
When the sampling distributions of p1 and p2 are normal, the
distribution of the differences between p1 and p2 is also
normal. Since the mean of the sampling distribution of p1 - p2
is equal to the difference between the two population
proportions, the distribution that follows is normal.
z
( p1  p 2 )  ( P1  P2 )
( p  p
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2)
4
PPT-12
Difference Between
Two Proportions and Independent Samples –
cont’d
0011 0010 1010 1101 0001 0100 1011
p1
p2
P1
P2
= sample proportion successes in first group
= sample proportion successes in second group
= population proportion of first group
= population proportion of second group
 ( p1  p2 ) = variance of the difference between two sample proportions
1
When P1 = P2, P1 - P2 = 0 and P1Q1 = P2Q2 = PQ where Q = 1 – P. Thus
z
p1  p 2
 (p p
1
where
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
( p  p ) 
1
2
2)
1 1
P1Q1 P2 Q2

 PQ  
n1
n2
 n1 n2 
2
4
PPT-13
Analysis of Variance
0011 0010 1010 1101 0001 0100 1011
• The two tests (z-tests and t-tests) are useful when testing a null
hypothesis when only two samples are involved. Analysis of
Variance (ANOVA) is often the preferred method to test whether
there is a significant difference among means of two or more
independent samples. It is applicable whenever a study involves
an interval- or ratio-scaled dependent variable.
• One-Way Analysis of Variance is discussed in this chapter. It is a
bivariate statistical technique that involves only one independent
variable, although there may be multiple levels of that variable.
• The null hypothesis for ANOVA is that the means of normally
distributed populations, such as three populations, a, b, c, are
equal or a = b = c.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-14
Analysis of Variance – cont’d
0011 0010 1010 1101 0001 0100 1011
If we take a random sample from each of the three original
populations, we may consider the three samples of subsets of a
single large sample drawn from the single large population.
X
Grand mean =
X
  Xb   Xc
a
454

X
13
1
2
4
The unbiased estimate of the large population variance ( ) based
on the preceding samples may be obtained by calculating the
variance between groups [MSA ( ŝ )] and the variance within
groups [MSE (ŝ )].
2
2
1
2
2
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
PPT-15
Variance Between Groups
The variance
between
(or between samples) is also referred to as
0011 0010
1010 1101
0001 groups
0100 1011
the “mean sum of squares between (among) groups.” It is sometimes
denoted as MSA or ŝ1 2 . It is written in a general form:
2
n
(
X

X
)
SS

i
i
between
sˆ1 2  MSA 

df
r 1
i = individual groups or samples a, b, c, …
ni = size of group i, or size of sample drawn from population i, such
as na  4, nb  5, nc  4 in the preceding illustration
X i = mean of the items in group or sample i
X i  X = grand mean, or mean of all items in the single large sample
X = deviation of group mean from grand mean
( X i  X ) 2 = variation, or squared deviation (The term “variation” has been
used loosely in previous discussions. Here, the term is limited
to represent the squared deviation.)
r = number of groups or samples, such as three groups in the above
illustration
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-16
Variance Between Groups – cont’d
0011 0010 1010 1101 0001 0100 1011
Note that the deviation X  X is called the effect, and the
nature of the sample i is called the treatment. Furthermore,
whenever ANOVA is used, the independent variables are
called factors, so the different levels (or categories) of a
factor are the treatments.
i
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-17
Variance Within Groups
0011 0010 1010 1101 0001 0100 1011
The variance within groups (or within individual samples) is
also referred to as the mean square error (MSE) or ŝ 2 2, since
it is an estimate of the random error existing in the data. It is
written in a general form
sˆ2 2
SS within
 MSE 

df
  ( X
2


X
)
i
i
nr
where X i  individual items in group i
1
2
4
n  na  nb  nc  number of items in the single large sample
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
PPT-18
F-Test
0011
Represents the variance ratio in showing the relationship between
the two
estimated
0010
1010independently
1101 0001 0100
1011 population variances
2
sˆ1
F 2
sˆ2
where the subscripts 1 (in the numerator) and 2 (in the
denominator) indicate the sample numbers and each represents the
estimate of the population variance based on the sample.
1
2
The F-statistic is the variance between groups divided by the
variance within groups. It is used to test for group differences and
compares one sample variance with another sample variance. It can
be presented this way:
Variance between groups
F = Variance within groups
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
=
4
MSA
MSE
PPT-19
Tests of Associations
0011 0010 1010 1101 0001 0100 1011
• Examine associations between two or more variables.
• When two groups are studied, there will always be a
variable that predicts the actions of another variable.
The predictor variable is the independent variable, and
the criterion variable is the dependent variable.
• Tests to measure statistical relationships between
variables are:
1
– Regression Analysis
– Correlation Analysis
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-20
Scatter Diagrams
0011 0010 1010 1101 0001 0100 1011
When two related variables, called bivariate data, are
plotted as points on a graph, the graph is called a scatter
diagram. A scatter diagram indicates whether the
relationship between the two variables is positive or
negative.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-21
Regression Analysis
• Refers to statistical techniques for measuring the linear or curvilinear
relationship between a dependent variable and one or more
0011 0010independent
1010 1101variables.
0001 0100
The1011
relationship between two variables is
characterized by how they vary together.
• Given pairs of X and Y variables, regression analysis measures the
direction (positive or negative) and rate of change (slope) in Y as X
changes, or vice versa. Using the values of the independent variable, it
attempts to predict the values of an interval- or ratio-scaled dependent
variable.
• Regression analysis requires two operations: (1) Derive an equation,
called the regression equation, and a line representing the equation to
describe the shape of the relationship between the variables. (2)
Estimate the dependent variable (Y) from the independent variable (X),
based on the relationship described by the regression equation.
• The regression line is the line drawn through a scatter diagram that
“best fits” the data points and most accurately describes the
relationship between the two variables.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-22
Regression Equation and Regression Line
0011 0010
1101 0001
0100 1011
While1010
all shapes
are informative,
a straight line is especially useful,
because it is the easiest to deal with in regression analysis to
describe the shape of the average relationship between two
variables. The straight line can be expressed by the linear equation:
Yc  a  bX
where Yc = computed value of the dependent variable
1
2
4
a
= Y-intercept where X equals zero
b
= slope of the regression line, which is the increase or decrease
in Y for each change of one unit of X
X
= a given value of the independent variable
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
PPT-23
Regression Equation and Regression Line –
cont’d
0011To
0010
1010
1101 0001model,
0100 1011
create
a regression
researchers
estimate the regression line
using the following equation
Y   o  1 X 1   i
where
o
1
Xi
i
i
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
=
=
=
=
=
1
2
Y-intercept where X equals zero
slope of the regression line, which is the increase or
decrease in Y for each change of one unit of X
a given value of the independent variable
observation number
error term associated with the ith observation
4
PPT-24
Least-Squares Method
0011 0010 1010 1101 0001 0100 1011
• A statistical technique that fits a straight line to a scatter diagram
by finding the smallest sum of the vertical distances squared
2
(i.e.,
) of 
allei the
points from the straight line. The equation
derived by this method will yield a regression line that best fits
the data.
• To calculate the straight line by the least-squares method, the
equation Yc  a  bX is used. We must first determine the
constants, a and b, which are called regression coefficients.
Regression coefficients are the values that represent the effect of
the individual independent variables on the dependent variable.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-25
Least-Squares Method – cont’d
0011 0010 1010 1101 0001 0100 1011
b
n XY    X   Y
n X 2    X 
2
Y
X


a
b
n
n
or
a  Y  bX
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
1
2
4
PPT-26
Standard Deviation of Regression
0011 0010
1010 1101
0001 0100
1011
The standard
deviation
of the
Y
values from the regression line ( Yc)
is called the standard deviation of regression. It is also popularly
called the standard of error of estimate, since it can be used to
measure the error of the estimates of individual Y values based on
the regression line. Thus
1
2
s y = the standard deviation of Y values from the mean 
4
s x = the standard deviation of X values from the mean X
s yx = the standard deviation of regression of Y values from Y
c
s xy = the standard deviation of regression of X values from X c
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
PPT-27
Standard Deviation of Regression – cont’d
The standard deviation of Y values from the regression line is based on the points
representing
values0001
scattered
0011
0010 1010Y1101
0100around
1011 the least-squares line. The closer the
points to the line, the smaller the value of the standard deviation of regression.
Thus, the estimates of Y values based on the line are more reliable. On the other
hand, the wider the points are scattered around the least-squares line, the larger
the standard deviation of regression and the smaller the reliability of the
estimates based on the line or the regression equation. The general formula for
the standard deviation of regression of Y values on X is
s yx 
2


Y

Y

c
nk
1
2
4
where k = number of total (dependent and independent) variables. However, a
simpler method of computing s yx is to use the following formula
s yx 
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
Y
  aY  b XY
nk
PPT-28
Correlation Analysis
0011 0010 1010 1101 0001 0100 1011
Correlation Analysis: Refers to the statistical
techniques for measuring the closeness of the
relationship between two metric (interval- or ratioscaled) variables. It measures the degree to which
changes in one variable are associated with changes in
another. The computation concerning the degree of
closeness is based on regression statistics.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-29
Total Deviation, Coefficient of Determination,
and Correlation Coefficient
0011 0010 1010 1101 0001 0100 1011
Total Deviation ( Y  Y ). Assume there are two variables, X and Y. The
mean of Y values = ( Y)/n,  , is obtained without referring to X values.
TheYc , representing the regression line of Y values = a + bx, is obtained
with the influence of X values. If Y values are related to X values to some
degree, the deviations of Y values from  must be reduced somewhat by
the introduction of X values in computing Yc values. The total deviation of
Y from the mean  is divided into two parts:
1
Y Y
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
=
Y  Yc
+
2
4
Total deviation = Unexplained deviation + Explained deviation
Yc  Y
PPT-30
Total Deviation, Coefficient of Determination,
and Correlation Coefficient – cont’d
0011 0010 1010 1101 0001 0100 1011
The explained variation  (Y  Y ) 2 may also be referred to as the
2
regression sum of squares (RSS). The unexplained variation  (Y  Yc )
is called the error sum of squares (ESS). This relationship may be
expressed as
Total variation
TSS
 (Y
 Y )2
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
1
2
4
= Unexplained variation
+ Explained variation
=
+
=
ESS
 (Y
 Yc ) 2
+
RSS
2
(
Y

Y
)
 c
PPT-31
Coefficient of Determination (r2)
The coefficient of determination (r2) is the strength of association or
0011 0010
1010
1101 0001
0100
1011
degree
of closeness
of the
relationship
between two variables measured by
a relative value. It demonstrates how well the regression line fits the
scattered points. It may be defined as the ratio of the explained variation to
the total variation:
Coefficient of determination = Explained variation
Total variation
 Y  Y 

 Y  Y 
2
r
or symbolically,
r2
2
c
2
=
1
RSS
TSS
2
4
r2
The range of the value is therefore from 0 to 1. When is close to 1, the Y
values are very close to the regression line. When r2 is close to 0, the Y values
are not close to the regression line.
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
PPT-32
Correlation Coefficient
square root of r2 or r 2   r
is frequently computed to indicate the direction of the
relationship in addition to indicating the degree of the
relationship.
It is the correlation between the observed and predicted values
of the dependent variable. Since the range of r2 is from 0 to 1,
the coefficient of correlation r will vary within the range of
0 to 1, or from 0 to +1.
0011 0010
1101 0001
0100 1011 the
The 1010
correlation
coefficient,
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-33
Decision Time!
0011 0010 1010 1101 0001 0100 1011
As a marketing manager, you want information from
marketing researchers that can enhance your decisionmaking abilities.
If correlation analysis is a popular and informative
statistical method, why should researchers bother using
more complex, somewhat intimidating bivariate
statistical techniques?
Do you feel that there is really that much to gain from
these methods? Why or why not?
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-34
Net Impact
0011 0010 1010 1101 0001 0100 1011
• The Internet can be a valuable tool to learn about
bivariate statistical techniques.
• Using almost any search engine, you can find a variety
of discussions about the topic.
• These discussions may be available on the Internet as
part of a company’s promotion of its statistical
services, a university professor’s statistical seminar
notes, or PowerPoint slides that were used in a
seminar presentation.
1
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
2
4
PPT-35
0011 0010 1010 1101 0001 0100 1011
Chapter 16
End of Presentation
Marketing Research, 2nd Edition
Alan T. Shao
Copyright © 2002 by South-Western
1
2
4
PPT-36