Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Experimental Research Methodology – Correlation analysis – Fernando Brito e Abreu ([email protected]) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Abstract Correlation analysis vs. experimentation Relations between variables Sample size problem Correlation Parametric coefficients Non Non-parametric parametric coefficients 1 Relations between variables The ultimate goal of every research or scientific analysis is finding relations between variables The Th philosophy hil h off science i teaches t h us that th t there th is i no other th way of representing "meaning" except in terms of relations between some quantities or qualities Either way involves relations between variables Thus, the advancement of Science must always involve finding and evaluating new relations between variables Isn’t that what correlation is about? Why care about experimentation, then? Correlation analysis vs. experimentation Correlation analysis We do not influence any variables but only measure them and look for relations (correlations) between some set of variables Those correlations are quantified as coefficients ∈ [0%, 100%] Example: practitioners’ expertise and defects found Experimentation We manipulate some variables and then measure the effects of this manipulation on other variables Example: a researcher increases design complexity and then records defects found, keeping all other variables constant Beware of learning effect when subjects are humans 2 Correlation analysis vs. experimentation Experimentation can conclusively demonstrate causal relations between variables If we find fi d that th t whenever h we change h variable i bl A A, th then variable i bl B changes, then we can conclude that "A influences B.“ Correlation analysis cannot conclusively prove causality We can find “high” correlation values between variables such as average literacy and expected lifetime lifetime, but there is no proven causality between them Question: why then, can we observe that correlation when analyzing data from countries worldwide? Correlation analysis vs. experimentation If experimental data may potentially provide qualitatively better information than correlational data, why care about correlation analysis at all? Correlation analysis only allows us to measure the association between variables, not their interdependence. Formally speaking: 1. 2. ((interdependence ⇒ association)) ¬ (association ⇒ interdependence) 3 Why care about correlation then? Correlation analysis can be useful for: to reduce the size of that set of explanatory variables Highly correlated ones may be measuring the same attribute Performing hypothesis a preliminary assessment of the feasibility of an A very low correlation (association) between a dependent and an independent variable may lead us to discard considering the hypothesis of a causality Most statistical tools allow us to produce crosscorrelation tables (symmetrical matrices with one by one correlation values among considered variables) The main diagonal is obviously filled with 1’s (100% correlation) Association between variables: properties Magnitude or size Signal of the association This property pertains to the strength of the association Several correlation coefficients (e.g. Pearson, Spearman) allow to quantify this magnitude Positive – when a variable increases, the other increases as well Negative – when a variable increases, the other decreases Significance reliability or truthfulness Significance, This property pertains to the representativeness of the result found in our specific model for the entire population It says how probable it is that a similar relation would be found if the experiment was replicated with other samples from the same population 4 Magnitude vs. reliability of relations Usually, the larger the magnitude of the relation between variables, the more reliable the relation But magnitude and reliability are not totally independent! Assuming that there is no relation between the respective variables in the population (null magnitude), the most likely outcome would be also finding no relation between those variables in the research sample Thus the weaker the relation found in the sample (less magnitude) Thus, magnitude), the less likely it is that there is no corresponding relation in the population Depending on sample size, a relation of a given strength can be either highly significant or no significant at all Sample size problem The smaller the sample size, the more likely it is that we will obtain erroneous results comparing p g to the p population p p parameters The error would be to assume the existence of a relation between two variables obtained from a population in which such a relation does not exist Technically speaking, the probability of a random deviation of a particular size (from the population mean), decreases with the increase in the sample size Conclusion: a smaller sample size implies a smaller reliability of associations 5 Wrap-up If the true association (in the population) between variables is: small, ll then th th there iis no way tto id identify tif such h a association i ti in a study, unless the research sample is correspondingly large very large, then it can be found to be highly significant even in a study based on a very small sample very Conclusion: the smaller the association between variables,, the larger the sample size required to prove it significant Correlation Correlation is the extent to which values of two variables are "proportional" to each other Proportional means linearly related Correlation is high if it can be approximated by a straight line The line is sloped upwards or downwards, depending on the signal of the association That regression line or least squares line is so so-called called because it is determined such that the sum of the squared distances of all the data points from the line is the minimum 6 Correlation coefficients The magnitude of the correlation can be expressed by a correlation coefficient Several S l coefficients ffi i t are proposed d iin th the lit literature t Some are parametric and others non-parametric The correlation coefficient does not depend on the specific measurement units used for example example, the correlation between Size and Effort will be identical regardless of whether Function Points and Man.Years, or KLOC and Man.Months are used as measurement units Parametric correlation coefficients The most widely-used type of correlation coefficient is Pearson r (Pearson, 1896) It iis also l called ll d linear li or product-moment d t t correlation l ti Assumptions Each pair of variables is bivariate normal The two variables are measured on at least interval scales SPSS: Analyse / Correlate / Bivariate 7 Nonparametric correlation coefficients These statistics do not require that variables are normally distributed Chi-square Assumptions: nominal scales Spearman p R,, Kendall Tau,, Gamma Assumptions: at least ordinal scales (ranks) For ordinal scales, if ranks are represented by literal enumerations you have to recode them into integers Spearman R correlation coefficient Spearman R is similar to the Pearson coefficient, except that can be computed from ranks SPSS: Analyse / Correlate / Bivariate 8 Kendall Tau correlation coefficient Kendall tau represents a probability It is the difference between the probability that the observed data are in the same order for the two variables versus the p probability y that the observed data are in different orders for the two variables Kendall tau is equivalent to the Spearman R statistic with regard to the underlying assumptions It is also comparable in terms of its statistical power However, is usuallyy not identical in magnitude g because its underlying y g logic, g as well as its formula, is very different SPSS: Analyse / Correlate / Bivariate Gamma coefficient Is preferable to Spearman R or Kendall tau when the data contain many tied observations (evident in a scatter plot) In terms of the underlying assumptions, Gamma is equivalent to Spearman R or Kendall tau In terms of its interpretation and computation, it is more similar to Kendall tau than Spearman R. Gamma is also a probability; It is computed as the difference between the probability that the rank ordering of the two variables agree minus the probability that they disagree, divided by 1 minus the probability of ties 9 Correlation significance The significance of the correlation between V1 and v2 is based on the following hypothesis H0: v1 and v2 are not associated H1: v1 and v2 may be associated Example: Parametric coefficient Correlations Functional Size Normalised Work Effort Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Functional Size 1 N W 3310 .459** .000 3287 Even for a test significance α = 0.01 (99% confidence) Since p=0.000 ≤ α Reject H0 and accept H1 (I cannot reject the hypothesis that Size nor Effort are correlated) 10 Example: Non-Parametric coefficients Correlations Kendall's tau_b Functional Size Normalised Work Effort Spearman's rho Functional Size Normalised Work Effort Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) Functional Size 1.000 . 3310 .471** .471 .000 3287 1.000 . 3310 .647** .000 Even for a test significance α = 0.01 (99% confidence) Since p=0.000 ≤ α Reject H0 and accept H1 (I cannot reject the hypothesis that Size nor Effort are correlated) 11