Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GS/PPAL 6200 3.00 Section N Research Methods and Information Systems A QUANTITATIVE RESEARCH PROJECT (1) DATA COLLECTION (2) DATA DESCRIPTION (3) DATA ANALYSIS Correlations • Is CGPA related in some systematic way to total hours studied (H)? • Remember, we need to account for the fact that they each tend to deviate from their true mean randomly. • The “correlation coefficient” for a set of observations is a function of how much each of the observed values deviate from the sample means adjusted for (i.e., not explained by) random deviation Correlations and Predictions • Presence of a (linear) correlation may offer predictive information that may be useful • It may (but may not) suggest causality to be examined further - “correlation does not imply causation” (when there is no control group) • It may suggest policy considerations (policy action, spillover effects, consequences) Representing Linear Correlation 1. For a population, the typical notation is: ρ (H,C) = corr(H,C) = cov (H,C)/σHσC = 1/(n-1) * Σ [(H-μH)(C- μC)]/ σHσC 2. For a sample from that same population (changing the notation to indicate the calculations are for the sample): r (H, C) = 1/(n-1) * Σ [(Hi-avgH)(Ci- avgC)]/ sHsC • Excel program to calculate (2) above: = CORREL (data array (H), data array (CGPA)), OR = PEARSON (data array (H), data array (CGPA)) Population Correlation Coefficient • The Pearson correlation coefficient (numbers above images) measures only the linear relationship between two variables "Correlation examples2" by Denis Boigelot, original uploader was Imagecreator - Own work, original uploader was Imagecreator. Licensed under CC0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Correlation_examples2.svg#/media/File:Correlation_exam ples2.svg Correlation Coefficient (= 0.816) versus Visual Inspection of Data "Anscombe's quartet 3" by Anscombe.svg: Schutzderivative work (label using subscripts): Avenue (talk) Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg#/media/File:Anscombe%27s_qu artet_3.svg 10-case Study Raw Data Case 1 2 3 4 5 6 7 8 9 10 Scatter Plot with Linear Trend CGPA 7.67 6.83 4.17 7.67 5.00 4.17 5.00 7.33 6.83 6.33 Total Hours Studied 35 29 23 50 32 22 17 40 44 38 Correlation for 10-case Study • = CORREL (CGPA, HOURS) • = PEARSON (CGPA, HOURS) • = 0.7944 • R-squared = 0.7944 * 0.7944 = 0.63 • If CGPA is a linear function of HOURS and CGPA is normally distributed, then R-squared gives the “explained variance” or 63% if the variation in CGPA can be “explained” by variation in HOURS Strength versus Significance • A “strong” correlation may or may not be significant • A “weak” correlation may or may not be significant • Key is the size of the sample – for small samples a strong correlation may still be by chance; for large samples it is easy to achieve significance for weak correlations Representing Linear Relationships • Since CGPA and HOURS appear to be strongly positively correlated (but it may only be an artifact of the small sample size) and statistically significant (despite being a small sample) then examine relationship more closely • General linear relationship: Y = mX + b • for Y dependent variable, X independent or explanatory variable, and b some constant Graphically Y = 1*X + 2 Y = mX + b 9 8 7 Y Variable • Locate coordinates (2, 4) that is, X = 2, Y = 4 • Locate coordinates (3, 5) • When X increases by +1 (from 2 to 3) how much does Y increase by? (=m) • When X = 0, what does Y equal? (= b) • Therefore model is 10 6 5 4 3 2 1 0 1 2 3 4 5 X Variable 6 7 CGPA and HOURS 8.00 7.00 6.00 CGPA • For the linear trend line, CGPA = Intercept (b) + coefficient (m) * HOURS • CGPA = 2.6 + 0.105*HOURS • For every +1 hour studied per month, by how much does CGPA increase? • How did we obtain the linear trend line? 9.00 5.00 4.00 3.00 2.00 1.00 0.00 0 10 20 30 Hours Studied 40 50 60 Regression Analysis - Intuition • The estimated linear trend line specifies the linear relationship that “best fits” the data • A “best fit” model is one that minimizes the amount an observation deviates from the hypothesized model • “Best fit” here means to minimize the sum of the squared deviations between the data points and the linear trend line (model) • “Linear Least Squares Regression Model” Regression Analysis - Mechanics • In Excel: “Data Analysis” “Regression” • Dependent Variable: CGPA • Coefficients: values of “b” (intercept) and “m” coefficient on independent (explanatory) variable • Standard Error, t-stat, P-value and CI (95%) for each estimate Data Interpretation (again) • From the Regression Output we know: CGPA = 2.6 + 0.1058*HOURS • For every +1 hour studied, CGPA on graduation increases by 0.1058 • Graduating students with +1 grade point higher than other graduating students, studied on average + 9.43 more hours per month (9.52 = 1 / 0.106) • And 95% CI suggests underlying (unobserved) population mean lies somewhere between 5.8 and 25 hours per month) Significance • The linear correlation between hours studied (independent variable) and CGPA (dependent variable) suggests a possible (causal) relationship. • But is the relationship “significant” statistically? Or did it occur by chance? Or is it an artifact of the small sample size and related only to sampling error? • Our next question: What is the likelihood that the relationship we observe is simply due to sampling error or chance? Significance Level and p-Values • Significance Level (α): Probability of rejecting the null hypothesis when it is true (α=1%, 5% or 10%) • P-value: Probability of observing this event (probability of obtaining a result equal to or more extreme that what is actually observed) – given that the null hypothesis is true • P-value < α, the data are inconsistent with the null hypothesis reject H0 • P-value > α, the data are consistent with the null hypothesis cannot reject H0 P-value • If the null hypothesis is true, what is the probability of obtaining values equal to or more extreme (greater or less) than what is observed in our data? • If the null hypothesis for our academic performance study is that there is no relationship between HOURS and CGPA (i.e., H0: m = 0), what is the probability that we will observe m = 0.106? • Probability P-value = 0.0061, much less than 0.05 = 5% (or 1% or 10%) level of significance = the rate of falsely rejecting H0 = rate of committing Type I error → therefore reject H0 t-statistic • An interval distance of +0.1 may or may not be “large” depending on the overall variation around the average (mean) • The interval distance between an observed value and the mean (or a hypothesized mean) of the variable needs to be adjusted or standardized to account for the overall variation • t-statistic for the sample • = [estimated(m)- hypothesized(m)]/SE • which has an approximately normal distribution with n-2 degrees of freedom Significance Level and t-tests • If the null hypothesis is that m = 0, we want to know if the estimated value of m = 0.106 is significantly different from m = 0 • t-stat = [estimated (m) – hypothesized (m)]/SE • = (0.106 – 0)/0.0286 = 3.7 • Is this standardized difference of 3.7 units significantly different from 0 at 95% for this sample size? Critical value for the t-stat = 2.306 (see next slide) • t-stat = 3.7 > 2.306 → difference is significantly different → reject H0: m = 0 → data support HA t-stat critical values • Use Excel to calculate the critical value for • = T.INV.2T(α, DF) = T.INV.2T(0.05, 8) = 2.306 Statistical Significance: Summary • P-value approach: P-value = 0.0061 < .05 or the probability this coefficient is obtained purely by chance is less than 5% reject H0 data support HA (H0: coefficient on HOURS = 0; HA: ≠ 0) • t-stat = 3.7 > 2.306 → 0.106 is statistically significantly different from 0 → reject H0: m = 0 → data support HA : m ≠ 0 Research Conclusion • Highly unlikely that the observed correlation occurred by chance; data support the hypothesis that hours studying is (positively) correlated with academic performance as measured by CGPA at graduation • Linear regression suggests that students with a higher +1 GPA at graduation studied an estimated +9.5 hours/month more every month than did students with a lower GPA • But the small sample size means a large Confidence Interval → population mean lies somewhere between 5.8* hours/month and 25* hours/month (95% of the time) [*take bounds on CI for m and convert to hours/month)