Download Notes 11 (revised)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Forecasting wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
GS/PPAL 6200 3.00 Section N
Research Methods and Information
Systems
A QUANTITATIVE RESEARCH PROJECT (1) DATA COLLECTION
(2) DATA DESCRIPTION
(3) DATA ANALYSIS
Correlations
• Is CGPA related in some systematic way to
total hours studied (H)?
• Remember, we need to account for the fact that they
each tend to deviate from their true mean randomly.
• The “correlation coefficient” for a set of
observations is a function of how much each
of the observed values deviate from the
sample means adjusted for (i.e., not explained
by) random deviation
Correlations and Predictions
• Presence of a (linear) correlation may offer
predictive information that may be useful
• It may (but may not) suggest causality to be
examined further - “correlation does not imply
causation” (when there is no control group)
• It may suggest policy considerations (policy
action, spillover effects, consequences)
Representing Linear Correlation
1. For a population, the typical notation is:
ρ (H,C)
= corr(H,C) = cov (H,C)/σHσC
= 1/(n-1) * Σ [(H-μH)(C- μC)]/ σHσC
2. For a sample from that same population (changing
the notation to indicate the calculations are for the
sample):
r (H, C)
= 1/(n-1) * Σ [(Hi-avgH)(Ci- avgC)]/ sHsC
• Excel program to calculate (2) above:
= CORREL (data array (H), data array (CGPA)), OR
= PEARSON (data array (H), data array (CGPA))
Population Correlation Coefficient
• The Pearson correlation coefficient (numbers above images)
measures only the linear relationship between two variables
"Correlation examples2" by Denis Boigelot, original uploader was Imagecreator - Own work, original
uploader was Imagecreator. Licensed under CC0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Correlation_examples2.svg#/media/File:Correlation_exam
ples2.svg
Correlation Coefficient (= 0.816) versus
Visual Inspection of Data
"Anscombe's quartet 3" by Anscombe.svg: Schutzderivative work (label using subscripts): Avenue (talk) Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg#/media/File:Anscombe%27s_qu
artet_3.svg
10-case Study
Raw Data
Case
1
2
3
4
5
6
7
8
9
10
Scatter Plot with Linear Trend
CGPA
7.67
6.83
4.17
7.67
5.00
4.17
5.00
7.33
6.83
6.33
Total Hours
Studied
35
29
23
50
32
22
17
40
44
38
Correlation for 10-case Study
• = CORREL (CGPA, HOURS)
• = PEARSON (CGPA, HOURS)
• = 0.7944
• R-squared = 0.7944 * 0.7944 = 0.63
• If CGPA is a linear function of HOURS and CGPA is
normally distributed, then R-squared gives the
“explained variance” or 63% if the variation in
CGPA can be “explained” by variation in HOURS
Strength versus Significance
• A “strong” correlation may or may not be
significant
• A “weak” correlation may or may not be
significant
• Key is the size of the sample – for small
samples a strong correlation may still be by
chance; for large samples it is easy to achieve
significance for weak correlations
Representing Linear Relationships
• Since CGPA and HOURS appear to be strongly
positively correlated (but it may only be an
artifact of the small sample size) and
statistically significant (despite being a small
sample) then examine relationship more
closely
• General linear relationship: Y = mX + b
• for Y dependent variable, X independent or
explanatory variable, and b some constant
Graphically
Y = 1*X + 2
Y = mX + b
9
8
7
Y Variable
• Locate coordinates (2, 4)
that is, X = 2, Y = 4
• Locate coordinates (3, 5)
• When X increases by +1
(from 2 to 3) how much
does Y increase by? (=m)
• When X = 0, what does Y
equal? (= b)
• Therefore model is
10
6
5
4
3
2
1
0
1
2
3
4
5
X Variable
6
7
CGPA and HOURS
8.00
7.00
6.00
CGPA
• For the linear trend line,
CGPA = Intercept (b) +
coefficient (m) * HOURS
• CGPA = 2.6 +
0.105*HOURS
• For every +1 hour studied
per month, by how much
does CGPA increase?
• How did we obtain the
linear trend line?
9.00
5.00
4.00
3.00
2.00
1.00
0.00
0
10
20
30
Hours Studied
40
50
60
Regression Analysis - Intuition
• The estimated linear trend line specifies the
linear relationship that “best fits” the data
• A “best fit” model is one that minimizes the
amount an observation deviates from the
hypothesized model
• “Best fit” here means to minimize the sum of
the squared deviations between the data
points and the linear trend line (model)
• “Linear Least Squares Regression Model”
Regression Analysis - Mechanics
• In Excel: “Data Analysis”  “Regression”
• Dependent Variable: CGPA
• Coefficients: values of “b” (intercept) and “m”
coefficient on independent (explanatory) variable
• Standard Error, t-stat, P-value and CI (95%) for
each estimate
Data Interpretation (again)
• From the Regression Output we know:
CGPA = 2.6 + 0.1058*HOURS
• For every +1 hour studied, CGPA on graduation
increases by 0.1058
• Graduating students with +1 grade point higher
than other graduating students, studied on
average + 9.43 more hours per month (9.52 = 1 /
0.106)
• And 95% CI suggests underlying (unobserved)
population mean lies somewhere between 5.8
and 25 hours per month)
Significance
• The linear correlation between hours studied
(independent variable) and CGPA (dependent
variable) suggests a possible (causal) relationship.
• But is the relationship “significant” statistically?
Or did it occur by chance? Or is it an artifact of
the small sample size and related only to
sampling error?
• Our next question: What is the likelihood that the
relationship we observe is simply due to sampling
error or chance?
Significance Level and p-Values
• Significance Level (α): Probability of rejecting the
null hypothesis when it is true (α=1%, 5% or 10%)
• P-value: Probability of observing this event
(probability of obtaining a result equal to or more
extreme that what is actually observed) – given
that the null hypothesis is true
• P-value < α, the data are inconsistent with the
null hypothesis  reject H0
• P-value > α, the data are consistent with the null
hypothesis  cannot reject H0
P-value
• If the null hypothesis is true, what is the
probability of obtaining values equal to or more
extreme (greater or less) than what is observed in
our data?
• If the null hypothesis for our academic
performance study is that there is no relationship
between HOURS and CGPA (i.e., H0: m = 0), what
is the probability that we will observe m = 0.106?
• Probability P-value = 0.0061, much less than 0.05
= 5% (or 1% or 10%) level of significance = the
rate of falsely rejecting H0 = rate of committing
Type I error → therefore reject H0
t-statistic
• An interval distance of +0.1 may or may not be “large”
depending on the overall variation around the average
(mean)
• The interval distance between an observed value and the
mean (or a hypothesized mean) of the variable needs to be
adjusted or standardized to account for the overall
variation
• t-statistic for the sample
• = [estimated(m)- hypothesized(m)]/SE
• which has an approximately normal distribution with n-2
degrees of freedom
Significance Level and t-tests
• If the null hypothesis is that m = 0, we want to
know if the estimated value of m = 0.106 is
significantly different from m = 0
• t-stat = [estimated (m) – hypothesized (m)]/SE
• = (0.106 – 0)/0.0286 = 3.7
• Is this standardized difference of 3.7 units
significantly different from 0 at 95% for this
sample size? Critical value for the t-stat = 2.306
(see next slide)
• t-stat = 3.7 > 2.306 → difference is significantly
different → reject H0: m = 0 → data support HA
t-stat critical values
• Use Excel to calculate the critical value for
• = T.INV.2T(α, DF) = T.INV.2T(0.05, 8) = 2.306
Statistical Significance: Summary
• P-value approach: P-value = 0.0061 < .05 or the
probability this coefficient is obtained purely by
chance is less than 5%  reject H0  data
support HA (H0: coefficient on HOURS = 0; HA: ≠ 0)
• t-stat = 3.7 > 2.306 → 0.106 is statistically
significantly different from 0 → reject H0: m = 0 →
data support HA : m ≠ 0
Research Conclusion
• Highly unlikely that the observed correlation occurred by
chance; data support the hypothesis that hours studying
is (positively) correlated with academic performance as
measured by CGPA at graduation
• Linear regression suggests that students with a higher +1
GPA at graduation studied an estimated +9.5
hours/month more every month than did students with a
lower GPA
• But the small sample size means a large Confidence
Interval → population mean lies somewhere between
5.8* hours/month and 25* hours/month (95% of the
time) [*take bounds on CI for m and convert to
hours/month)