• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Data assimilation wikipedia, lookup

Choice modelling wikipedia, lookup

Least squares wikipedia, lookup

Coefficient of determination wikipedia, lookup

Interaction (statistics) wikipedia, lookup

Transcript
```Research in business studies
SPRING 2009-10
Quantitative and Qualitative Data Analysis
by
Assoc. Prof. Sami Fethi
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Quantitative data analysis
 Examining differences
 Relationship between variables
 Explaining and predicting relationship between variables
 Data reduction, structure and dimension
 Characteristic of qualitative research
 Qualitative data
 Analytical procedure
 Interpretation
 Strategies for qualitative analysis
 Quantify qualitative data
 Validity in qualitative research
2
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Examining differences
 In research we often have to make statements about the mean. When the
population variance is unknown, the stadard error of the mean is also
unknown. The standard error of the mean must be estimated from sample
data.
 e.g. SDX= SD‘/ N
where
SDX= standard error of mean
SD‘= estimated standard deviation
N= sample size


N
( xi  X )
i 1
2
SD‘=
N 1
N-1 is degrees of freedom
 Example 1: For a supermarket chain to add a new product, at least 100
units must be sold per week. The new product is tested in ten randomly
selected stores for a limited time.
Apply a test such as one-tailed t test and answer the question that will the
new product sell more than 100 unit per week?
a) construct hypothesis
b) calculate mean and standard deviation if they are not given.
c) calculate standart error of mean
3
d) find t- value
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Examining differences
a) H0: X<=100
H1: X>100
b) X and SD are given 109.4 and 14.90 respectively.
c) SDX = 14.90/ 10  1 =4.55
d) t= (X-µ)/SDX=(109.4-100)/4.55=2.07
Where t-table is 1.83 at 5% significant level.
We reject the null
 This is usually associated with such a question: Are
the tastes in region A different from the tastes in
region B?
( X 1  X 2 )  (1  2 )
Z

 e.g.
SD
X1  X 2
Where
X1= sample mean for the first sample
X2= sample mean for the second sample
4
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Examining differences
SDX 1  X 2 = the standard eror of differences in means
µ1 and µ2 are the unknown population means and
the general estimate of:
SD 21 SD 2 2
SDX1  X 2  SDX2 1  SDX2 2 

N1

N2
In assuming the two population variances to be equal, the
common population variance can be generated by pooling the
samples. When the variances are unknonw and the standard
errors of means must be estimated, then the t represents an
adequate test statistics, distributed with v= N1+ N2-2- degrees
of freedom.
Example2: A manufacturer has developed a new product and
wonders whether the label of the package should be red or
blue. The new products with two different labels are tested in
ten randomly selected stores. The means sales obtained for
the red package are 403.0 and for the blue package 390.3. The
standard error of estimate for the difference means is 8.15.
5
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Examining differences
a) construct hypothesis
b) find t- value
a) H0: (µ1- µ2 )=0
H1: (µ1- µ2 )≠0
or
H0: (µ1- µ2 )<=0
H1: (µ1- µ2 )>0
b)
t
( X 1  X 2 )  (1   2 )
=((403.0-390.3)-0)/8.15=1.56
SDX1  X 2
V=10+10-2=18 degrees of freedom...5% and df 18 so
critical value from the table is 2.101. This means that null
hypothesis is accepted.. H0: (µ1- µ2 )=0. This means that the two
unknown population means are assumed to be same.
6
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Useful alternative tests




o




In problems involving one or two population means, t-methods are
usually appropriate, but often non-parametric methods are good
alternatives.
e.g. Non-parametric methods have advantage of requiring less in
terms of assumptions and less powerful than t-methods (see siegel
and Castella; 1998).
e.g. The main difference between them is that t-method associates
with means while non-parametric methods are concerned with
medians.
ANOVA- analysis of variance measures comparisons of more than two
groups simultaneously. This method rests on comparing the ratio of
systematic variance to unsystematic variance.
In ANOVA, the following is computed:
Total variation by comparing each observation with the grand mean.
The between-group variation by comparing the treatment means with
the grand mean.
The within-group variation by comparing each score in the group with
the group mean.
Recall-MANOVA-multivariate analysis of variance. This has more than
one dependent variable compared to ANOVA:
7
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Comparison of more than two group
Example 3: In the
following table,
campaigns tested
in 24 randomly
selected cities
comparable in size
and demographics.
The following
output is an anova
analysis results:
Source Sum Degree Mean F-ratio
of
of
sq.
freedom
sq.
Between 49.0
2
24.1 5.88
group
Within
group
87.5
21
total
136.5
23
4.17
8
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Example 3
a) construct hypothesis
b) find F- value whether significant or not
c) Comment on the F-values
a) H0: G1= G2= G3
H1: G1≠ G2 ≠ G3
d.f= 24-1=23, between group 3-1=2 within group 232=21.
b) Fcalculated=24.1/4.17=5.88
Fcritical=n-k,k-1=24-3,3-1=(21,2). From F-distribution,
Fcritical is 3.47.
c) Since 5.88 is greater than 3.47, we reject the null
hypothesis, that is, the group means are equal and
accept the alternative hypothesis that the advertising
campaigns vary in effectiveness.
9
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Relationship between variables
 In research, we are often preoccupied with whether
there is a relationship or two or more variables covary.
o Correlation coefficient
Based on the Pearson criterion, it examines the strength
of linear relationship between two variables, for example
x and y.
o Theoretically, the Correlation coefficient can take the
values from -1 to 1. A correlation coefficient of 1 tells us
that two variables perfectly covary positively whereas -1
shows that two variables perfectly inversely related.
Close to 0 indicates that the variables are unrelated.
The formula of the Correlation coefficient as fololw:
Where X and Y represent the sample means of X and Y.
rXY 
 ( x  X )( y  Y )
 (x  X )  ( y  Y )
i
i
2
i
10
2
i
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Relationship between variables
o Correlation coefficient
A Correlation coefficient shows covariation between two variables,
and not that the variables are causally related.
The square of the Correlation coefficient is the coefficient of
determination.
R2=Explained variation/Total variation
o Example 4- partial correlation
Using the following table (Table 1) and calculate the relationship
inluenced by controling for sex?
11
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Example 4
o This is partial correlation and can be formulated as follow based on
partial Correlation coefficient r123 as such ad.roc, appeal, sex
r123 
r12  (r13 ) (r23 )
1  r13
2
1  r23
2
0.24  (0.33) (0.09)
1  (0.33)
2
1  (0.09)
2
 0.29
o This shows that controlling for sex the observed relationship
between ad.roc, and appeal positive and strengthened.
12
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Explaining and predicting relationship between variables
o Explaining and predicting relationship between variables are
useful approaches to examining relationships between variables is
regression analysis. In regression analysis, we want to fit a model
that best describes the data which is done in regression analysis by
applying the method of least squares. More precisely, this is done by
fitting a straight line that minimizes the squared vertical deviations
from that line as shown in following figure.
o Single Linear Regression
Y= a0+a1xi+ei
Where Y=the outcome variable, X=predictor variable, a1=slope of the
straight line fitted to the data and a0=intercept of the line and
ei=difference between the score predicted and the score actually
obtained. This is called residual.
13
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Single Linear Regression
Explaining and Predicting Relationship between Variables
Figure 1 The linear model
14
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Single Linear Regression
Example 5
o Assume that a car dealer collects data for six months on four
advertising and sales. Y is sales. The car dealer expects carsales to
Table 2 Data matrix
15
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Simple Mean Regression-output
Example 5
o Assume that a car dealer collects data for six months on four
advertising and sales. Y is sales. The car dealer expects carsales to
correlated with competitors’ ads. Based on the information below,
comment on the estimated coefficinent and T-ratio as well as R2
Table 3 Simple mean regression-output
16
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Simple Mean Regression-output
o The estimated constant term 0.7 shows that If the dealer does not
carsale is 0.7 unit that is 7 car. The estimated regression coefficient
of sales on Tv-Ads is 0.9. This coefficient shows that if the variable
Tv-ads is increased by 1 unit, the estimated expected value of
carsales increases by 0.9 units, that is nine car. The result, Rsquare, R2 that is 85.3 percent shows that the sample determination of
coefficient is equal to 0.853. Practically speaking, this means that the
variation in the variable Tv-ads has explained 85.3 percent of the
variations in the dependent variable carsales. Estimated t-value on Tvads is 4.81 which is greater than 2 (tabular value from t-distribution) or
rule of thumb so it is signficant 5% and 1% levels. This means that we
can reject the null hypothesis that is the corresponding population
regression coefficient is equal to zore. The conclusion then is that Tvads and sales are significantly related to each other or Tv-ads has
positive impact on sales.
17
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Assumptions in Regression analysis
o The expected value of the error term is zero
o The variance for the error term for each X is constant.
This term homoscedasticity. If the variance to e varies
with X, this is termed heteroscedasticity.
o The error for the observations are uncorrelated.
o e should be normally distributed for each X.
o The error term should not be correlated with x-corr(e,
x)=0
o It is also a common assumption that the regression
model should be linear in its parameters.
18
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Correlation Coefficients-output
Example 6
o Assume that a car dealer collects data for six months
competitors’ advertising and sales. Y is sales. The car
dealer expects carsales to be positively correlated with
competitors’ ads. Use the concept of correlation
coefficient and explain the relationships between the
variable under inspection based on the information given
in table 4.
Table 4 Correlation coefficients-output
19
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Correlation Coefficients-output
o The relationship between carsales (dependent) and Tv
(explanatory) are expected to be high. The relationship
between the explanatory variables as such Tv
are expected to be low. So high correlation coefficient
between for example Tv advertising and printing
advertising shows a high degree of multicollinearity.
This influences the estimates results badly. To remedy
this situation, the relevant variable can be dropped from
the regression equation. For example between sales
and Tv-ads is 0.92 which is highly reasonable score or
between sales and Comp-ads is 0.155 which is very low
score .
20
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Multiple Regression
o In multiple regression, at least two or more independent or
explanatory variables are applied to explain/predict the dependent
variable. The purpose is to make the model more realistic, control for
other variables, and explain more of the variance in the dependent
variable as well as reduce the residuals. The following is a typical
example output for a multiple regression.
21
Table 5 Multiple regression – output
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Dummy Variables
o In a multiple regression, dummy variable can be used in two ways.
As a dependent variables where its values take 1 or 0 that is also
called dichotomous. The other type can be used as independent
variable which takes the value 0 or 1. The dummy variable used in
an analysis when there does not exist as numerical values. For
example, in the following table that is a nominal scaled variable that
can not be ranked so to be applied in a regression analysis, the
seasons need to be assigned numbers
Table 6 Coding of dummy variable
22
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Dummy variables
Example 7
o In the following table, there three new variables A, B and C and indicates
that the four seasons are different combinations of zeros and ones. Assume
that the following regression model for sales of women’s clothing where the
price (P) is also included, has been estimated:
Sale=1000 - 0.5P+100A - 20B - 50C
a) Calculate the sales in the summer by considering dummy variables as
well (i.e. p=\$200 ).
b) Calculate the sales in the autumn by considering dummy variables as well
(i.e. p=\$200 ).
c) Compare the sales in winter and spring by keeping the same price.
Table 6 Coding of dummy variable
23
© 2009/10, Sami Fethi, EMU, All Right Reserved, Pearson Education, 2005, 3. Ed.
Dummy variables
o
In the following table, there three new variables A, B and C and
indicates that the four seasons are different combinations of zeros and
ones. Assume that the following regression model for sales of women’s
clothing where the price (P) is also included, has been estimated:
Sale=1000 - 0.5P+100A - 20B - 50C
a) Calculate the sales in the summer by considering dummy variables as
well (i.e. p=\$200 ).
Sale=1000 - 0.5 (200)+100(1) – 20(0) – 50(0)=\$1000
b) Calculate the sales in the autumn by considering dummy variables as
well (i.e. p=\$200 ).
Sale=1000 - 0.5 (200)+100(0) – 20(1) – 50(0)= \$880
c) Compare the sales in winter and spring by keeping the same price.
Winter- Sale=1000 - 0.5 (200)+100(0) – 20(0) – 50(1)= \$950
spring- Sale=1000 - 0.5 (200)+100(0) – 20(0) – 50(0)= \$900