Download Results and analysis 2

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Psychometrics wikipedia , lookup

Omnibus test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
RESEARCH METHODOLOGY
RESULT AND ANALYSIS
(part 2)
HYPOTHESIS TESTING
A hypothesis
 is a conjecture about a population parameter. This
conjecture may or may not be true.
 An educated guess based on theory and background
information
 A proposed explanation for a phenomenon.
Hypothesis Testing is a process of using sample data and
statistical procedures to decide whether to reject or not
reject a hypothesis (statement) about a population
parameter value.
Examples
 Whether seat belts will reduce the severity of injuries
caused by accident
 Whether the public prefer certain colour in the fabric
lining
 Whether adding a chemical will improve water
quality
 The average life expectancy in the next decade for
man will be more than 100 years
 Education increases income
education increases income
 a positive relationship between the concepts "education" and
"income."
 This abstract or conceptual hypothesis cannot be tested.
First, it must be operationalized or situated in the real world
by rules of interpretation. Consider again the simple
hypothesis "Education increases Income."
 To test the hypothesis the abstract meaning of education and
income must be derived or operationalized. The concepts
should be measured. Education could be measured by
"years of school completed" or "highest degree
completed" etc. Income could be measured by
"hourly rate of pay" or "yearly salary" etc.
Two type of statistical hypothesis
The Null Hypothesis: symbolised by Ho, states that
there is no difference between a parameter and a
specific value OR that there is no difference between
two parameters. NULL means NO CHANGE.
Statement of equality
 The Alternative Hypothesis: symbolised by Ha,
states a specific difference between parameter and a
specific value OR states that there is a difference
between two parameters. TEST or Research
Hyphothesis.

Situation A: A researcher is interested in finding out
whether a new medicine will have any undesirable
side effects on the pulse rate of the patient. Will the
pulse rate increase, decrease or remain unchanged.
Since the researcher knows the pulse rate of the
population under study is 82 beats per minute, the
hypothesis will be
Ho :  = 82 (remain uncahnged)
H1 :   82 (will be different)
This is a two-tailed test since the possible effect
could be to raise or lower the pulse
Situation B: A chemist invents an additive to
increase the life of an automobile battery. The
mean life time of ordinary battery is 36 months. The
hypothesis will be:
Ho :   36
Ha :  > 36
The chemist is interested only in increasing the
lifespan of the battery. His alternative hypothesis is
that the mean is larger than 36.
Therefore the test is called right-tailed, interested
in the increase only.
Situation C: A contractor wishes to lower heating
bill by using a special type of insulation in house.
If the average monthly bill is RM100, his
hypothesis will be
Ho :   RM 100
H1 :   RM 100
This is a left-tailed test since the contractor is
only interested in reducing the bill
General Procedure for testing the hypothesis. Can be
done statistically.
 Step 1: State the hypothesis
 Step 2: find critical value for a selected level of significant or
formulate an analysis plan e.g. 0.1, 0.05, 0.01. Consider
case for one-tailed or two-tailed
 Step 3: Analyze sample data.
 Step 4: Interpret results or make the decision to reject or
not to reject the hypothesis. If test value < critical value
accept Ho. test value > critical value reject Ho.
significant difference
A significant difference occurs if the difference between
the hypothesized (null) value and the sample statistic
value is too large to be attributed to chance. A
significant difference strongly suggests that the null
hypothesis is not true.
Significant difference at p<0.05 means, 95% of
the time the sample mean is larger than the
hypothesised value.
TESTING THE DIFFERENCE AMONG
MEANS AND VARIANCE
Situations:
 To compare the average lifetime of two
difference brands of tires
 Two different brands of fertilizer, whether one is
better than the other for growing plants
 Two brands of cough syrup, to test whether one
brand is more effective than the other
Problem 1: Two-Tailed Test
Suppose the Acme Drug Company develops a new drug,
designed to prevent colds. The company states that the drug
is equally effective for men and women. To test this claim,
they choose a a simple random sample of 100 women and
200 men from a population of 100,000 volunteers.
At the end of the study, 38% of the women caught a cold; and
51% of the men caught a cold. Based on these findings, can
we reject the company's claim that the drug is equally
effective for men and women? Use a 0.05 level of
significance.
Solution:
 State the hypotheses. The first step is to state the null
hypothesis and an alternative hypothesis.
Null hypothesis: P1 = P2
Alternative hypothesis: P1 ≠ P2
Note that these hypotheses constitute a two-tailed test. The null
hypothesis will be rejected if the proportion from population 1 is
too big or if it is too small.
 Formulate an analysis plan. For this analysis, the significance
level is 0.05. The test method is a two-proportion z-test.
 Analyze sample data. Using sample data, we calculate the
pooled sample proportion (p) and the standard error (SE). Using
those measures, we compute the z-score test statistic (z).
p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) + (0.51 *
200)] / (100 + 200) = 140/300 = 0.467
SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt
[0.003733] = 0.061
z = (p1 - p2) / SE = (0.51 - 0.38)/0.061 = 2.13
where p1 is the sample proportion in sample 1, where p2 is the
sample proportion in sample 2, n1 is the size of sample 2, and n2
is the size of sample 2.
 Since we have a two-tailed test, the P-value is the probability that
the z-score is less than -2.13 or greater than 2.13.
 We use the Normal Distribution Calculator to find P(z < -2.13)
= 0.017, and P(z > 2.13) = 0.017. Thus, the P-value = 0.017 +
0.017 = 0.034.
 Interpret results. Since the P-value (0.034) is less than the
significance level (0.05), we cannot accept the null
hypothesis.
Problem 2: One-Tailed Test
Suppose the previous example is stated a little bit differently.
Suppose the Acme Drug Company develops a new drug,
designed to prevent colds. The company states that the drug is
more effective for women than for men. To test this claim,
they choose a a simple random sample of 100 women and
200 men from a population of 100,000 volunteers.
At the end of the study, 38% of the women caught a cold; and
51% of the men caught a cold. Based on these findings, can
we conclude that the drug is more effective for women than
for men? Use a 0.01 level of significance.
Solution:
 State the hypotheses. The first step is to state the null
hypothesis and an alternative hypothesis.
Null hypothesis: P1 >= P2
Alternative hypothesis: P1 < P2
Note that these hypotheses constitute a one-tailed test. The
null hypothesis will be rejected if the proportion of women
catching cold (p1) is sufficiently smaller than the proportion
of men catching cold (p2).
 Formulate an analysis plan. For this analysis, the
significance level is 0.01. The test method is a two-proportion
z-test.
 Analyze sample data. Using sample data, we calculate the pooled
sample proportion (p) and the standard error (SE). Using those
measures, we compute the z-score test statistic (z).
p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) + (0.51 *
200)] / (100 + 200) = 140/300 = 0.467
SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt [0.003733]
= 0.061
z = (p1 - p2) / SE = (0.38 - 0.51)/0.061 = -2.13
where p1 is the sample proportion in sample 1, where p2 is the sample
proportion in sample 2, n1 is the size of sample 2, and n2 is the size
of sample 2.
 Since we have a one-tailed test, the P-value is the probability that
the z-score is less than -2.13. We use the Normal Distribution
Calculator to find P(z < -2.13) = 0.017. Thus, the P-value =
0.017.
 Interpret results. Since the P-value (0.017) is greater than the
significance level (0.01), we cannot reject the null
hypothesis.
Commonly used Methods
1. z-test
 For detecting difference between two means for large
sample (two samples)
 Assumptions required
 The sample must be independent, that is no
relationship between the subject in the sample
 The sample must be normally distributed
Example problem
Suppose that in a particular geographic region, the mean and
standard deviation of scores on a reading test are 100 points, and
12 points, respectively. Our interest is in the scores of 55 students
in a particular school who received a mean score of 96. We can ask
whether this mean score is significantly lower than the regional
mean — that is, are the students in this school comparable to a
simple random sample of 55 students from the region as a whole,
or are their scores surprisingly low. Calculate z – score?
solution
 We begin by calculating the standard error (SE) of the mean:
Next we calculate the z-score, which is the distance from the
sample mean to the population mean in units of the standard
error:
1.
2.
problem
the mean and standard deviation of scores on a calculating test are 120
points, and 18 points, respectively. Our interest is in the scores of 81
students in a particular school who received a mean score of 92. We can ask
whether this mean score is significantly lower than the regional mean —
that is, are the students in this school comparable to a simple random
sample of 81 students from the region as a whole, or are their scores
surprisingly low. Calculate Z- score?
Every year, 50,000 runners compete in the Peachtree Road Race. They run
10 kilometers (a little over 6 miles). The average finishing time is 55
minutes, with a standard deviation of 10 minutes. Fred and Wilma
completed the race in 61 and 51 minutes, respectively. Barney and Betty had
finishing times with z-scores of -0.3 and 0.7, respectively.
List the runners in order, starting with the fastest runner and ending
with the slowest runner.
(A) Wilma, Barney, Fred, Betty
(B) Barney, Wilma, Fred, Betty
(C) Wilman, Barney, Betty, Fred
(D) Betty, Fred, Barney, Wilma
(E) None of the above
solution
1. Calculate (SE) of the mean:
SE 

18
18


2
n
81 9
Next we calculate the z-score
M   92  120  28
Z


 3.11
SE
2
9
solution
2. The answer is A. This problem can be solved by converting Fred and
Wilma's raw scores into z-scores. To do this, we use the z-score
equation: To do this, we use the z-score equation:
z = (M-µ) / sd
where z is the z-score, x is the runner's raw score, M is the mean
finishing time, and sd is the standard deviation of finishing times.
Solving first for Fred's z-score, we get
z = (M-µ) / sd = ( 61-55) / 10 = 0.60
Using the same approach to compute Wilma's z-score, we get
z = (M-µ) / sd = ( 51-55) / 10 = - 0.4
Based on z-scores, we can order the runners from fastest to slowest as
follows: Wilma (z = -0.4), Barney (z = -0.3), Fred (z = 0.6), and Betty
(z = 0.7).
problem
 Each year, a national achievement test is administered to 3rd
graders. The test has a mean score of 100 and a standard
deviation of 15. If Jane's z-score is 1.20, what was her score on
the test?
(A) 82
(B) 88
(C) 100
(D) 112
(E) 118
solution
 The correct answer is (E). From the z-score equation, we
know
z = (M-µ) / sd
where z is the z-score, x is the value of Jane's test score, M is
the mean test score, and sd is the standard deviation of test
scores.
Solving for Jane's test score (M), we get
M = ( z * sd) + 100 = ( 1.20 * 15) + 100 = 18 + 100 =
118
2. F test
 For the comparison of two variances or standard
deviations. E.g variation in cholesterol level in man and
women
 Assumptions
 The population from which the samples were
obtained must be normally distributed
 Samples must be independent of each other
Example problem
 Consider an experiment to study the
effect of three different levels of a factor
on a response (e.g. three levels of a
fertilizer on plant growth). If we had 6
observations for each level, we could
write the outcome of the experiment in
a table like this, where a1, a2, and a3 are
the three levels of the factor being
studied.
a1
6
8
4
5
3
4
a2
8
12
9
11
6
8
a3
13
9
11
8
7
12
solution
Step 1: Calculate the mean within each group:
Step 2: Calculate the overall mean:
where a is the number of groups.
 Step 3: Calculate the "between-group" sum of squares:
where n is the number of data values per group.
The between-group degrees of freedom is one less than the number of
groups
fb = 3 − 1 = 2
so the between-group mean square value is
MSB = 84 / 2 = 42
 Step 4: Calculate the "within-group" sum of squares. Begin
by centering the data in each group
a1
a2
a3
6 − 5 = 1 8 − 9 = -1 13 − 10 = 3
8 − 5 = 3 12 − 9 = 3 9 − 10 = -1
4 − 5 = -1 9 − 9 = 0 11 − 10 = 1
5 − 5 = 0 11 − 9 = 2 8 − 10 = -2
3 − 5 = -2 6 − 9 = -3 7 − 10 = -3
4 − 5 = -1 8 − 9 = -1 12 − 10 = 2
The within-group sum of squares is the sum of squares of all 18 values in this
table
SW = 1 + 9 + 1 + 0 + 4 + 1 + 1 + 9 + 0 + 4 + 9 + 1 + 9 + 1 + 1 + 4 + 9 + 4 =
68
The within-group degrees of freedom is
fW = a(n − 1) = 3(6 − 1) = 15
 Thus the within-group mean square value is
 Step 5: The F-ratio is
2. t-test
 To test the difference between two means for small
independent sample (n<30)
 Assumptions
 Sample must be independent
 The populations are normally distributed
CORRELATION AND REGRESSION
 Correlation is a statistical method used to determine whether
a relationship between variable exists. Correlation attempts
to study the strength of the mutual relationship between two
variables. In correlation we assume that the variables are
random and dependence of any nature is not involved.
 Regression describe the nature of the relationship between
variables. Regression studies the relationship where
dependence is necessarily involved. One variable has the
dependence on a certain number of variables. Regression can
be used for predicting the values of the variable which
depends upon other variables.
Linear and Non Linear Correlation
 Linear Correlation:
Correlation is said to be linear if the ratio of change is constant.
The amount of output in a factory is doubled by doubling the
number of workers is the example of linear correlation.
In other words it can be defined as if all the points on the scatter
diagram tends to lie near a line which are look like a straight line,
the correlation is said to be linear, as shown in the figure.
 Non Linear (Curvilinear) Correlation:
Correlation is said to be non linear if the ratio of change is
not constant. In other words it can be defined as if all the points
on the scatter diagram tends to lie near a smooth curve, the
correlation is said to be non linear (curvilinear), as shown in the
figure.
Positive and Negative Correlation
Positive Correlation:
The correlation in the same direction is called positive correlation.
If one variable increase other is also increase and one variable
decrease other is also decrease. For example, the length of an iron
bar will increase as the temperature increases.
Negative Correlation:
The correlation in opposite direction is called negative
correlation, if one variable is increase other is decrease and vice
versa, for example, the volume of gas will decrease as the pressure
increase or the demand of a particular commodity is increase as
price of such commodity is decrease.
No Correlation or Zero Correlation:
If there is no relationship between the two variables such that the
value of one variable change and the other variable remain constant
is called no or zero correlation.
Perfect Correlation
If there is any change in the value of one variable, the value of the
others variable is changed in a fixed proportion, the correlation
between them is said to be perfect correlation. It is indicated
numerically as +1 and -1.
 Perfect Positive Correlation:
If the values of both the variables are move in same
direction with fixed proportion is called perfect positive
correlation. It is indicated numerically as +1.
 Perfect Negative Correlation:
If the values of both the variables are move in opposite
direction with fixed proportion is called perfect negative
correlation. It is indicated numerically as -1.
Coefficient of Correlation
For sample data the correlation coefficient
denoted by “r” is a measure of strength of the
linear relation between X and Y variables, where
“r” is a pure number and lies between -1 and +1.
Examples of Correlation
 Calculate and analyze the correlation coefficient between the
number of study hours and the number of sleeping hours of
different students.
Solution:
 The necessary calculation is given below:
There is perfect negative correlation between the number
of study hours and the number of sleeping hours.
Problem
 From the following data, compute the coefficient of
correlation between X and Y:
Summation of products of deviations of X and Y
series from their arithmetic means = 122.
Solution:
LINEAR REGRESSION
 If the plot of n pairs of data (x , y) for an experiment appear to
indicate a "linear relationship" between y and x, then the method
of least squares may be used to write a linear relationship between
x and y.
 The least square regression line for the set of n data points is given
by
y = ax + b
where a and b are given by
Example
 Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}
a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line in the same
rectangular system of axes.
Solutions
a) Let us organize the data in a table.
We now use the above formula to calculate a and b as follows
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (3*9 - 2*2) / (3*14 - 22) =
23/38
b = (1/n)(Σy - a Σx) = (1/3)(2 - (23/38)*2) = 5/19
b) We now graph the regression line given by y = ax + b and
the given points.
Problems
2 a) Find the least square regression line for the following set of data
{(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in the same rectangular
system of axes.
3 The values of y and their corresponding values of y are shown in the table
below
a) Find the least square regression line y = ax + b.
b) Estimate the value of y when x = 10.
4 The sales of a company (in million dollars) for each year are shown in the
table below.
a) Find the least square regression line y = ax + b.
b) Use the least squares regression line as a model to estimate the sales
of the company in 2012.
SOLUTION
Solution
Solution
Multiple Regression
 Several independent variables and one dependent
Y’ = a +b1x1+ b2x2 + ……. bkxk
 Assumptions for multiple regression
 For any specific value of independent variable, the value of the y
variable are normally distributed (normality assumption)
 The variances or standard deviation for the y variable are the same
for each value of the independent variable (equal variance
assumption)
 There is a linear relationship between the dependent variable and
the independent variable (linearity assumption)
 The independent variables are not correlated
 The values for the y variables are independent
NON-PARAMETRIC TEST
 Z, f and t-tests are parametric – when data are normally
distributed
 When data is not normally distributed – Non-Parametric test is
more appropriate.
 Also called Distribution Free Statistics
Non-parametric methods are widely used for studying populations
that take on a ranked order (such as movie reviews receiving one
to four stars). The use of non-parametric methods may be
necessary when data have a ranking but no clear numerical
interpretation, such as when assessing preferences; in terms of
level of measurement, for data on an ordinal scale.
Advantages & Disadvantages
Advantages of Non Parametric Test
 Can be used when the variable is not normally distributed
 Can be used when data is small
 Can be used to test hypothesis
 The computation is easier
 Easier to understand
Disadvantages
 Less sensitive
 Less information
 Less efficient
USING MODELS
Be sure with data requirement and the need of the study
Consists of 4 main steps
 Model formulation
 Model optimization
 Model calibration/verification
 Model Application
Model Formulation
 Involved empirical and theoretical evidences
 Make assumptions – to reduce the problem to a manageable form
(simplification of process)
Model optimization
 Regression analysis – analytical way
 Subjective optimization – based on experience of the modelers
Model Calibration
 Changing the coefficient
 Reduce error between observed and predicted values
Model Application
 After the model has been calibrated and validated