Download Correlation analysis - Moodle@FCT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Experimental Research Methodology
– Correlation analysis –
Fernando Brito e Abreu ([email protected])
Universidade Nova de Lisboa (http://www.unl.pt)
QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR)
Abstract
Correlation analysis vs. experimentation
„ Relations between variables
„ Sample size problem
„ Correlation
„ Parametric coefficients
„ Non
Non-parametric
parametric coefficients
„
1
Relations between variables
„
The ultimate goal of every research or scientific analysis
is finding relations between variables
… The
Th
philosophy
hil
h off science
i
teaches
t
h us that
th t there
th
is
i no other
th
way of representing "meaning" except in terms of relations
between some quantities or qualities
„
„
Either way involves relations between variables
Thus, the advancement of Science must always involve
finding and evaluating new relations between variables
… Isn’t
that what correlation is about?
… Why care about experimentation, then?
Correlation analysis vs. experimentation
„
Correlation analysis
… We
do not influence any variables but only measure them and
look for relations (correlations) between some set of variables
… Those correlations are quantified as coefficients ∈ [0%, 100%]
„
„
Example: practitioners’ expertise and defects found
Experimentation
… We
manipulate some variables and then measure the effects of
this manipulation on other variables
„
Example: a researcher increases design complexity and then records
defects found, keeping all other variables constant
…
Beware of learning effect when subjects are humans
2
Correlation analysis vs. experimentation
„
Experimentation can conclusively demonstrate causal
relations between variables
… If
we find
fi d that
th t whenever
h
we change
h
variable
i bl A
A, th
then variable
i bl B
changes, then we can conclude that "A influences B.“
„
Correlation analysis cannot conclusively prove causality
… We
can find “high” correlation values between variables such
as average literacy and expected lifetime
lifetime, but there is no
proven causality between them
„
Question: why then, can we observe that correlation when analyzing
data from countries worldwide?
Correlation analysis vs. experimentation
„
If experimental data may potentially provide
qualitatively better information than correlational data,
why care about correlation analysis at all?
„
Correlation analysis only allows us to measure the
association between variables, not their
interdependence. Formally speaking:
1.
2.
((interdependence ⇒ association))
¬ (association ⇒ interdependence)
3
Why care about correlation then?
„
Correlation analysis can be useful for:
… to
„
reduce the size of that set of explanatory variables
Highly correlated ones may be measuring the same attribute
… Performing
hypothesis
„
„
a preliminary assessment of the feasibility of an
A very low correlation (association) between a dependent and an
independent variable may lead us to discard considering the hypothesis
of a causality
Most statistical tools allow us to produce crosscorrelation tables (symmetrical matrices with one by one
correlation values among considered variables)
… The
main diagonal is obviously filled with 1’s (100% correlation)
Association between variables: properties
„
Magnitude or size
…
…
„
Signal of the association
…
…
„
This property pertains to the strength of the association
Several correlation coefficients (e.g. Pearson, Spearman) allow to quantify
this magnitude
Positive – when a variable increases, the other increases as well
Negative – when a variable increases, the other decreases
Significance reliability or truthfulness
Significance,
…
This property pertains to the representativeness of the result found in our
specific model for the entire population
„
It says how probable it is that a similar relation would be found if the experiment
was replicated with other samples from the same population
4
Magnitude vs. reliability of relations
„
Usually, the larger the magnitude of the relation between
variables, the more reliable the relation
… But magnitude and reliability are not totally independent!
„
Assuming that there is no relation between the respective
variables in the population (null magnitude), the most likely
outcome would be also finding no relation between those
variables in the research sample
…
„
Thus the weaker the relation found in the sample (less magnitude)
Thus,
magnitude), the less
likely it is that there is no corresponding relation in the population
Depending on sample size, a relation of a given strength can be
either highly significant or no significant at all
Sample size problem
„
The smaller the sample size, the more likely it is that we will obtain
erroneous results comparing
p
g to the p
population
p
p
parameters
…
The error would be to assume the existence of a relation between two
variables obtained from a population in which such a relation does not exist
„
Technically speaking, the probability of a random deviation of a
particular size (from the population mean), decreases with the
increase in the sample size
„
Conclusion: a smaller sample size implies a smaller reliability
of associations
5
Wrap-up
„
If the true association (in the population) between
variables is:
small,
ll then
th th
there iis no way tto id
identify
tif such
h a association
i ti
in a study, unless the research sample is correspondingly large
… very large, then it can be found to be highly significant even in
a study based on a very small sample
… very
Conclusion: the smaller the association between variables,,
the larger the sample size required to prove it significant
Correlation
„
Correlation is the extent to which values of two variables
are "proportional" to each other
„
Proportional means linearly related
… Correlation
„
is high if it can be approximated by a straight line
The line is sloped upwards or downwards, depending on the signal of
the association
… That
regression line or least squares line is so
so-called
called
because it is determined such that the sum of the squared
distances of all the data points from the line is the minimum
6
Correlation coefficients
„
The magnitude of the correlation can be expressed by a
correlation coefficient
… Several
S
l
„
„
coefficients
ffi i t are proposed
d iin th
the lit
literature
t
Some are parametric and others non-parametric
The correlation coefficient does not depend on the
specific measurement units used
… for
example
example, the correlation between Size and Effort will be
identical regardless of whether Function Points and Man.Years,
or KLOC and Man.Months are used as measurement units
Parametric correlation coefficients
„
The most widely-used type of correlation coefficient is
Pearson r (Pearson, 1896)
… It
„
iis also
l called
ll d linear
li
or product-moment
d t
t correlation
l ti
Assumptions
… Each
pair of variables is bivariate normal
… The two variables are measured on at least interval scales
„
SPSS: Analyse / Correlate / Bivariate
7
Nonparametric correlation coefficients
„
These statistics do not require that variables are
normally distributed
„
Chi-square
… Assumptions:
„
nominal scales
Spearman
p
R,, Kendall Tau,, Gamma
… Assumptions:
at least ordinal scales (ranks)
… For ordinal scales, if ranks are represented by literal
enumerations you have to recode them into integers
Spearman R correlation coefficient
„
Spearman R is similar to the Pearson coefficient, except
that can be computed from ranks
SPSS: Analyse / Correlate / Bivariate
8
Kendall Tau correlation coefficient
„
Kendall tau represents a probability
…
„
It is the difference between the probability that the observed data are in the
same order for the two variables versus the p
probability
y that the observed
data are in different orders for the two variables
Kendall tau is equivalent to the Spearman R statistic with regard
to the underlying assumptions
…
…
It is also comparable in terms of its statistical power
However, is usuallyy not identical in magnitude
g
because its underlying
y g logic,
g
as well as its formula, is very different
SPSS: Analyse / Correlate / Bivariate
Gamma coefficient
„
Is preferable to Spearman R or Kendall tau when the data contain
many tied observations (evident in a scatter plot)
„
In terms of the underlying assumptions, Gamma is equivalent to
Spearman R or Kendall tau
In terms of its interpretation and computation, it is more similar to
Kendall tau than Spearman R.
„
„
Gamma is also a probability;
…
It is computed as the difference between the probability that the rank
ordering of the two variables agree minus the probability that they disagree,
divided by 1 minus the probability of ties
9
Correlation significance
„
The significance of the correlation between V1 and v2 is
based on the following hypothesis
„
H0: v1 and v2 are not associated
H1: v1 and v2 may be associated
„
Example: Parametric coefficient
Correlations
Functional Size
Normalised Work Effort
„
„
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Functional
Size
1
N
W
3310
.459**
.000
3287
Even for a test significance α = 0.01 (99% confidence)
Since p=0.000 ≤ α
… Reject
H0 and accept H1 (I cannot reject the hypothesis that
Size nor Effort are correlated)
10
Example: Non-Parametric coefficients
Correlations
Kendall's tau_b
Functional Size
Normalised Work Effort
Spearman's rho
Functional Size
Normalised Work Effort
„
„
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
Functional
Size
1.000
.
3310
.471**
.471
.000
3287
1.000
.
3310
.647**
.000
Even for a test significance α = 0.01 (99% confidence)
Since p=0.000 ≤ α
… Reject
H0 and accept H1 (I cannot reject the hypothesis that
Size nor Effort are correlated)
11