Download Basic Statistics for SGPE Students [.3cm] Part I: Descriptive Statistics

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Basic Statistics for SGPE Students
Part I: Descriptive Statistics
Achim Ahrens
Anna Babloyan
[email protected]
[email protected]
Erkal Ersoy
[email protected]
Heriot-Watt University, Edinburgh
September 2015
Outline
1. Descriptive statistics
I
I
I
I
Sample statistics (mean, variance, percentiles)
Graphs (box plot, histogram)
Data transformations (log transformation, unit of measure)
Correlation vs. Causation
2. Probability theory
I
I
Conditional probabilities and independence
Bayes’ theorem
3. Probability distributions
I
I
I
I
Discrete and continuous probability functions
Probability density function & cumulative distribution function
Binomial, Poisson and Normal distribution
E[X] and V[X]
4. Statistical inference
I
I
I
I
I
Population vs. sample
Law of large numbers
Central limit theorem
Confidence intervals
Hypothesis testing and p-values
1 / 46
Descriptive statistics
I
In recent years, more and better-quality data have been
recorded than any other time in history.
I
The increasing size of data sets that are readily available to us
has enabled us to adopt new and more robust statistical tools.
I
Rising data availability has (unfortunately) led to empirical
researchers to sometimes overlook some preliminary steps,
such as summarizing and visually examining their data sets.
I
Ignoring these preliminary steps can lead to important issues
and invalidate seemingly significant results.
As we will see in this and following lectures, there are ways in
which we can numerically summarize a data set. Before we discuss
those approaches, let’s take a quick look at what’s available to us
to visualize a data set graphically.
2 / 46
Descriptive statistics
Histograms
I
Histograms are extremely useful in getting a good graphical
representation of the distribution of data. These figures
consist of adjacent rectangles over discrete intervals, whose
areas are the frequency of observations in the interval.
I
Histograms are often normalized to show the proportion (or
densities) of observations that fall into non-overlapping
categories. In such cases, the total area under the bins
equal 1.
Remark
The height of each bin in a normalized histogram represents
density or proportion of observations that fall into that category.
These can more easily be interpreted as percentages.
3 / 46
Descriptive statistics
Histograms
0
.01
.02
.03
Density
.04 .05
.06
.07
.08
1960
25
I
I
I
30
35
40
45
50
55
60
65
Life expectancy (in years)
70
75
80
85
Approximately, what is the average life expectancy in 1960?
Roughly what percentage of countries had life expectancy
above 65?
What proportion of countries had a life expectancy less than
55 years?
4 / 46
Descriptive statistics
Histograms
0
.01
.02
.03
Density
.04 .05
.06
.07
.08
1960
25
30
35
40
45
50
55
60
65
Life expectancy (in years)
70
75
80
85
5 / 46
Descriptive statistics
Histograms
0
.01
.02
.03
Density
.04 .05
.06
.07
.08
1990
25
30
35
40
45
50
55
60
65
Life expectancy (in years)
70
75
80
85
5 / 46
Descriptive statistics
Histograms
0
.01
.02
.03
Density
.04 .05
.06
.07
.08
2011
25
30
35
40
45
50
55
60
65
Life expectancy (in years)
70
75
80
85
5 / 46
Descriptive statistics
The Mean and Standard Deviation
I
A histogram can help summarize large amounts of data, but
we often like to see an even shorter (and sometimes easier to
interpret) summary. This is usually provided by the mean and
the standard deviation.
I
The mean (and median) are frequently used to find the
center, whereas standard deviation measures the spread.
Definition
The (arithmetic) mean of a list of numbers is their sum divided by
how many there are.
For example, the mean of 9, 1, 2, 2, 0 is
More generally, mean = x̄ =
1
n
n
P
xi ;
9+1+2+2+0
5
= 2.8.
i = 1...n
i=1
6 / 46
Descriptive statistics
The Mean and Standard Deviation
I
I
The standard deviation (SD) tells us how far numbers on a
list deviate from their average. Usually, most numbers are
within one SD around the mean.
More specifically, for normally distributed variables, about
68% of entries are within one SD of the mean and about 95%
of entries are within two SDs.
68%
mean
mean - one SD
mean + one SD
95%
mean
mean - two SDs
mean + two SDs
7 / 46
Descriptive statistics
Computing the Standard Deviation
Definition
p
Standard Deviation = mean of (deviations from the mean)2
where deviation from mean = entry − mean
And in formal notation, σ =
µ = N1 (x1 + ... + xN ).
q
1
N
PN
i=1 (xi
− µ)2 , where
Example: Find the SD of 20, 10, 15, 15.
Answer: mean = x̄ = 20+10+15+15
= 15 Then, the deviations are
4
5, −5, 0, 0,q
respectively.
q
√
2
2
2
2
So, SD = 5 +(−5)4 +0 +0 = 50
12.5 ≈ 3.5
4 =
Remark
The SD comes out in the same units as the data. For example, if
the data are a set of individuals’ heights in inches, the SD is in
inches too.
8 / 46
Descriptive statistics
The Root-Mean-Square
Consider the following list of numbers: 0, 5, −8, 7, −3.
Question: How big are these numbers? What is their mean?
The mean is 0.2, but this does not tell us much about the size of
the numbers–it only implies that the positive numbers slightly
outweigh the negative ones.
To get a better sense of their size, we could use the mean of their
absolute values. Statisticians tend to use another measure, though:
The root-mean-square.
Definition
Root −mean − square (rms) =
p
average of (entries)2
9 / 46
Descriptive statistics
The Root-Mean-Square and Standard Deviation
There is an alternative way of calculating SD using
root-mean-square:
Remark
SD =
p
mean of (entries)2 − (mean of entries)2
Recall the four numbers we used earlier to calculate SD:
20, 10, 15, 15.
2
2
2
2
+15
mean of (entries)2 = 20 +10 +15
= 950
4
4 = 237.5
60 2
20+10+15+15 2
2
) = ( 4 ) = 225 Therefore,
(mean of entries) = (
4
√
√
SD = 237.5 − 225 = 12.5 ≈ 3.5, which agrees with what we
found earlier.
10 / 46
Descriptive statistics
Variance
I
In probability theory and statistics, variance gets mentioned
nearly as often as the mean and standard deviation. It is very
closely related to SD and is a measure of how far a set of
numbers lie from their mean.
I
Variance is the second moment of a distribution (mean being
the first moment), and therefore, tells us about the properties
of the distribution (more on these later).
Definition
Variance = (Std. Dev.)2 = σ 2
11 / 46
Descriptive statistics
Normal Approximation for Data and Percentiles
S&P 500, January 2001 - December 2001
-1 s.d.
mean
+1 s.d. +2 s.d. +3 s.d. +4 s.d.
10,000
15,000
Volume (thousands)
0
10
20
Frequency
30
40
50
60
-2 s.d.
5,000
20,000
25,000
Source: Yahoo! Finance and Commodity Systems, Inc.
Is the normal approximation satisfactory here?
12 / 46
Descriptive statistics
Normal Approximation for Data and Percentiles
2011 with normal
-1 s.d.
mean
+1 s.d.
+2 s.d.
0
.01
.02
.03
Density
.04 .05
.06
.07
.08
-2 s.d.
45
50
55
60
65
70
75
80
Life expectancy (in years)
85
90
95
How about here?
13 / 46
Descriptive statistics
Normal Approximation for Data and Percentiles
Remark
The mean and SD can be used to effectively summarize data that
follow the normal curve, but these summary statistics can be much
less satisfactory for data that do not follow the normal curve.
In such cases, statisticians often opt for using percentiles to
summarize distributions.
Table. Selected percentiles for life expectancy in 2011
Percentiles Value
1
48
10
52.6
25
63
50
73.4
75
76.9
95
81.8
99
82.7
14 / 46
Descriptive statistics
Calculating percentiles
1. Order all the values in your data set in ascending order (i.e.
smallest to largest).
2. Select a percentile, P, that you would like to calculate and
multiply it by the total number of entries in your data set, n.
The value you obtain here is called the index.
3. If the index is not a whole number, round it up to the next
integer.
4. Count the entries in your list of numbers starting from the
smallest one until you get to the number indicated by your
index.
5. This entry is the kth percentile in your data set.
15 / 46
Descriptive statistics
Calculating percentiles
Example
Consider the following list of 5 numbers: 10, 15, 20, 25, 30. What is
the entry that corresponds to the 25th percentile? What is the
median?
To obtain the 25th percentile, all we need to do is 0.25 × 5 = 1.25.
After rounding, this value becomes 1, so 25th percentile in this
case is the first entry, 10.
We were also asked to obtain the median. To do this, calculate
0.5 × 5 = 2.5. Rounding this to the nearest whole number gives 3.
So, the median in this case is 20.
16 / 46
Descriptive statistics
Percentiles
The 1st percentile of the distribution is approximately 48, meaning
that the life expectancy in 1% of countries in 2011 was 48 or less,
and 99% of countries had life expectancy higher than that.
Similarly, the fact that 25th percentile is 63 implies that 25% of
countries had life expectancy of 63 or less, whereas 75% had a
longer expected lifespan.
Definition
Interquartile range is defined as 75th percentile − 25th percentile
and is sometimes used as a measure of spread, particularly when
the SD would pay too much (or too little) attention to a small
percentage of cases in the tails of the distribution.
From the table above, the interquartile range equals
76.9 − 63 = 13.9 (and SD was 10.14).
17 / 46
Descriptive statistics
Box plots
The structure of a box plot:
Adjacent line
(Upper adjacent value)
The largest value within
75th percentile +
Whiskers
75th percentile/3rd quartile (upper hinge)
Box
50th percentile (median)
25th percentile/1st quartile (lower hinge)
The smallest value within
25th percentile -
Whiskers
Adjacent line
(Lower adjacent value)
Entries less than the lower adjacent value
18 / 46
Descriptive statistics
Box plots
Are there
any clear
patterns
emerging
from
summarizing the
data this
way?
Life expectancy (in years)
70
80
90
50
60
Life expectancy by region in 2011
EAS
ECS
LCN
MEA
NAC
SAS
SSF
Legend
EAS: East Asia & Pacific
ECS: Europe & Central Asia
LCN: Latin America & Caribbean
MEA: Middle East & North Africa
NAC: North America
SAS: South Asia
SSF: Sub-Saharan Africa
19 / 46
Descriptive statistics
Box plots
We might be able to spot some patterns that developed over time
if we look at different years:
Life expectancy by region
25
35
Life expectancy (in years)
65
55
75
45
85
1960
EAS
ECS
LCN
MEA
NAC
SAS
SSF
20 / 46
Descriptive statistics
Box plots
We might be able to spot some patterns that developed over time
if we look at different years:
Life expectancy by region
25
35
Life expectancy (in years)
65
55
75
45
85
1990
EAS
ECS
LCN
MEA
NAC
SAS
SSF
20 / 46
Descriptive statistics
Box plots
We might be able to spot some patterns that developed over time
if we look at different years:
Life expectancy by region
25
35
Life expectancy (in years)
65
55
75
45
85
2011
EAS
ECS
LCN
MEA
NAC
SAS
SSF
20 / 46
Data Transformations
The effects of changing the unit of measure
I
I
I
I
Now that we know how to summarize a dataset, let us turn to
investigating the effects of changing the unit of measure for a
variable on the mean and standard deviation.
Such changes in the unit of measure could be for practical
reasons or based on theory, but regardless of the reason, a
statistician should know what to expect.
To study this, let’s consider a dataset on 200 individuals’
weights and heights.
Each entry is originally reported in kg and cm, respectively,
and below are some summary statistics:
Table.
Variable
Weight (kg)
Height (cm)
Summary statistics
Mean
Standard Deviation
65.8
15.1
170.02 12.01
21 / 46
Data Transformations
The effects of changing the unit of measure
And here are some diagrams that summarize the distribution of the
two variables.
Weight
Height
measured in kg
measured in cm
mean +1 s.d. +2 s.d.
-2 s.d.
-1 s.d.
mean
+1 s.d.
+2 s.d.
0
0
.01
.01
Density
.02
Density
.02
.03
.03
.04
.04
-2 s.d. -1 s.d.
40
50
60
70
80 90 100 110 120 130 140 150 160 170
Measured weight in kg
140
150
160
170
180
Measured height in cm
190
200
Does the normal approximation look satisfactory?
22 / 46
Data Transformations
The effects of changing the unit of measure
And here are some diagrams that summarize the distribution of the
two variables.
50
50
Measured weight in kg
150
100
Measured height in cm
100
150
200
Height (cm) by sex
200
Weight (kg) by sex
F
M
F
M
23 / 46
Data Transformations
The effects of changing the unit of measure
And here are some diagrams that summarize the distribution of the
two variables.
Weight
Weight
measured in kg
measured in lb
-2 s.d.
-1 s.d.
mean
+1 s.d. +2 s.d.
Density
.01
0
0
.01
Density
.02
.03
.02
mean +1 s.d. +2 s.d.
.04
-2 s.d. -1 s.d.
40
50
60
70
80 90 100 110 120 130 140 150 160 170
Measured weight in kg
80 100 120 140 160 180 200 220 240 260 280 300 320 340 360
Measured weight in pounds
Do you think the mean matches the original one (in correct units)?
How about the standard deviation?
24 / 46
Data Transformations
The effects of changing the unit of measure
And here are some diagrams that summarize the distribution of the
two variables.
Height
Height
measured in cm
-1 s.d.
mean
measured in in
+1 s.d.
+2 s.d.
-2 s.d.
-1 s.d.
mean
+1 s.d.
+2 s.d.
65
70
Measured height in inches
75
0
0
.02
.01
.04
Density
.06
.08
Density
.02
.03
.1
.04
.12
-2 s.d.
140
150
160
170
180
Measured height in cm
190
200
55
60
80
Do you think the mean matches the original one (in correct units)?
How about the standard deviation?
25 / 46
Data Transformations
The effects of changing the unit of measure
Here are the box plots with the transformed data:
Height (in) by sex
20
100
Measured height in inches
40
60
Measured weight in pounds
200
250
300
150
350
80
Weight (lb) by sex
F
M
F
M
26 / 46
Data Transformations
The effects of changing the unit of measure
I
I
I
Observations made using the figures are, of course, based on
what statisticians and econometricians often call "eye-balling"
the data. These observations are certainly not formal, but are
a crucial part of effectively analyzing any dataset.
In fact, you should make plotting, investigating and
eye-balling your data a habit before you dive into complicated
models and overlook important features of your dataset.
Now that we have made our informal observations, let’s look
at the actual numbers.
Variable
Weight (kg)
Height (cm)
Weight (lb)
Height (in)
Table. Summary statistics
Mean
SD
Mean (converted)
65.8
15.1
65.8 × 2.2 ≈ 145.06
170.02 12.01 66.94
145.06 33.28 145.06/2.2 ≈ 65.8
66.94
4.73
170.02
SD (converted)
15.1 × 2.2 ≈ 33.28
4.73
33.28/2.2 ≈ 15.1
12.01
27 / 46
Data Transformations
The effects of changing the unit of measure
I
I
I
Observations made using the figures are, of course, based on
what statisticians and econometricians often call "eye-balling"
the data. These observations are certainly not formal, but are
a crucial part of effectively analyzing any dataset.
In fact, you should make plotting, investigating and
eye-balling your data a habit before you dive into complicated
models and overlook important features of your dataset.
Now that we have made our informal observations, let’s look
at the actual numbers.
Variable
Weight (kg)
Height (cm)
Weight (lb)
Height (in)
Table. Summary statistics
Mean
SD
Mean (converted)
65.8
15.1
145.06
170.02 12.01 170.02 × 0.4 ≈ 66.94
145.06 33.28 65.8
66.94
4.73
66.94 × 2.5 ≈ 170.02
SD (converted)
33.28
12.01 × 0.4 ≈ 4.73
15.1
4.73 × 2.5 ≈ 12.01
27 / 46
Data Transformations
The effects of changing the unit of measure
We have seen that the mean and the standard deviation remain
the same when we change the unit of measure, but how does
variance behave?
Variable
Weight (kg)
Height (cm)
Weight (lb)
Height (in)
Mean
65.8
170.02
145.06
66.94
Table. Summary statistics
Variance Mean (converted)
228.01
65.8 × 2.2 ≈ 145.06
144.24
66.94
1107.56 145.06/2.2 ≈ 65.8
22.37
170.02
Variance (converted)
228.01 × 2.2 ≈ 502.68
56.79
1107.56/2.2 ≈ 502.38
56.81
28 / 46
Data Transformations
The effects of changing the unit of measure
We have seen that the mean and the standard deviation remain
the same when we change the unit of measure, but how does
variance behave?
Variable
Weight (kg)
Height (cm)
Weight (lb)
Height (in)
Mean
65.8
170.02
145.06
66.94
Table. Summary statistics
Variance Mean (converted)
228.01
145.06
144.24
170.02 × 0.4 ≈ 66.94
1107.56 65.8
22.37
66.94 × 2.54 ≈ 170.02
Variance (converted)
502.68
144.24 × 0.4 ≈ 56.79
502.38
22.37 × 2.54 ≈ 56.81
1
Note that 1 inch = 2.54 cm and similarly, 1cm = 2.54
= 0.3937in.
2
Then, 22.37 ≈ (0.3937) × 144.24. The opposite is true as well:
144.24 ≈ (2.54)2 × 22.37. And we can apply the same to the
weights in kg and lbs. And in general ...
28 / 46
Data Transformations
Properties of Variance
...variance is scaled by the square of the constant by which all the
values are scaled. While we are at it, here are some basic
properties of variance:
Basic properties of variance
I
Variance is non-negative: Var(X ) ≥ 0
I
Variance of a constant random variable is zero:
P(X = a) = 1 ↔ Var(X ) = 0
I
Var(aX) = a2 Var(X)
I
However, Var(X + a) = Var(X )
I
For two random variables X and Y ,
Var(aX + bY ) = a 2 Var(X ) + b2 Var(Y ) + 2abCov(X , Y )
I
...but Var(X − Y ) = Var(X ) + Var(Y ) − 2Cov(X , Y )
F
29 / 46
Data Transformations
Log Transformation
So far, we have only worked with transformations in which we
multiply each value with a constant. However, more complicated
transformations are quite common in statistics and econometrics.
One of the most common and useful transformations uses the
natural logarithm.
Definition
Data transformation refers to applying a specific operation to each
point in a dataset, in which each data point is replaced with the
transformed one. That is, xi are replaced by yi = f (xi ).
In our previous example with heights, our function, f (x), was
simply f (x) = 2.54x. Now, let us study a different function: the
natural logarithm.
30 / 46
Output-side real GDP at current PPPs (in mil. 2005US$)
500000.000 1000000.000 1500000.000 2000000.000
UK Real GDP
1960
1970
1980
Year
1990
Natural log of output-side real GDP at current PPPs
13
13.5
14.5
14
Data Transformations
Log Transformation
Log transformation in action:
2000
2010
UK Real GDP
1960
1970
1980
Year
1990
2000
2010
31 / 46
Data Transformations
Log Transformation
UK Real GDP
Natural log of real GDP at current PPPs (in mil. 2005US$)
14
13.5
14.5
UK Real GDP
13
0.000
Output-side real GDP at current PPPs (in mil. 2005US$)
1000000.000
2000000.000
Log transformation in action:
1960
1970
1980
1990
Year
2000
2010
1960
1970
1980
1990
2000
2010
Year
31 / 46
Data Transformations
Log Transformation
Log Life expectancy vs. Log Real GDP
3.8
50
Life expectancy (in years)
4.2
4
Life expectancy (in years)
60
70
80
4.4
90
Life Expectancy vs. Real GDP
0.000
5000000.000
10000000.000
Output-side real GDP at current PPPs (in mil. 2005US$)
15000000.000
6
8
10
12
14
16
Natural log of output-side real GDP at current PPPs
Important note
The log transformation can only be used for variables that have
positive values (why?). If the variable has zeros, the
transformation can be applied only after these figures are replaced
(usually by one-half of the smallest positive value in the data set).
32 / 46
Data Transformations
Log Transformation
Year: 2011
Region
JPN
GBR
80
EAS
USA
Life expectancy (in years) [linear scale]
ECS
LCN
CHN
IDN
MEA
RUS
NAC
IND
SAS
SSF
60
Population
(in million)
ZAF
10
50
100
250
40
500
1000
0
10000
20000
30000
40000
50000
60000
Real GDP per capita (at constant 2005 national prices) [linear scale]
33 / 46
Data Transformations
Log Transformation
Year: 1960
Region
80
EAS
Life expectancy (in years) [linear scale]
ECS
LCN
GBR
USA
MEA
JPN
NAC
SAS
SSF
60
Population
(in million)
10
ZAF
CHN
50
IDN
IND
100
250
40
500
1000
156.25
312.5
625
1250
2500
5000
10000
20000
40000
80000
Real GDP per capita (at constant 2005 national prices) [log scale]
33 / 46
Data Transformations
Log Transformation
Year: 1990
Region
Life expectancy (in years) [linear scale]
80
EAS
JPN
GBRUSA
CHN
ECS
LCN
MEA
RUS
NAC
SAS
IDN
60
ZAF
SSF
IND
Population
(in million)
10
50
100
250
40
500
1000
156.25
312.5
625
1250
2500
5000
10000
20000
40000
80000
Real GDP per capita (at constant 2005 national prices) [log scale]
33 / 46
Data Transformations
Log Transformation
Year: 2011
Region
JPN
GBR
USA
80
EAS
Life expectancy (in years) [linear scale]
ECS
LCN
CHN
IDN
MEA
RUS
NAC
IND
SAS
SSF
60
Population
(in million)
ZAF
10
50
100
250
40
500
1000
156.25
312.5
625
1250
2500
5000
10000
20000
40000
80000
Real GDP per capita (at constant 2005 national prices) [log scale]
33 / 46
Data Transformations
Log Transformation and growth
A useful feature of the log transformation is the interpretation of
its first difference as a percentage change (for small changes). This
is because ln(1 + x) ≈ x for a small x: Wolfram Alpha
Strictly speaking, a percentage change in Y from period t − 1 to
t−1
, which is approximately equal to
period t is defined as YtY−Y
t−1
ln(Yt ) − ln(Yt−1 ). And the approximation is almost exact if the
percentage change is small.
To see this, consider the percentage change in US GDP from 2010
to 2011:
Table. US Real GDP (in mil. 2005 US$)
Year GDP
Percentage change ln(Yt )
ln(Y2011 ) − ln(Y2010 )
2010 12993576 1.803507
16.379966 1.787436
2011 13227916 .
16.39784
.
And the difference in percentage change is
0.01803507 − 0.01787436 = 0.00016071—a discrepancy that we might be
willing to live with.
34 / 46
Examining Relationships
Covariance and Correlation
Our daily lives (and not just within economics) are filled with
statements about the relationship between two variables. For
example, we might read about a study that found that men spend
more money online than women.
The relationship between gender and spending more online may
not be this simple, of course–income might play a role in this
observed pattern. Ideally, we would like to set up an experiment in
which we control the behavior of one variable (keeping everything
else the same) and observe its effect on another. This is often not
feasible in economics (a lot more on this later!).
For the time being, let’s focus on simple correlation.
35 / 46
Examining Relationships
Covariance and Correlation
Scatter plots are very useful in identifying the sign and strength of
the relationship between two variables. Therefore, it’s always
extremely useful to plot your data and investigate what the
relationship between your two variables are:
65.00
Life expectancy
70.00
75.00
80.00
85.00
Life Expectancy (in years) vs. Internet usage
0.00
20.00
40.00
60.00
Internet users per 100 people
80.00
36 / 46
Examining Relationships
Covariance and Correlation
But these plots can also be misleading to the eye simply by
changing the scale of the axes:
95.00
Life Expectancy (in years) vs. Internet usage
0.00
20.00
40.00
60.00
Internet users per 100 people
80.00
45.00
65.00
55.00
Life expectancy
75.00
65.00
Life expectancy
70.00
75.00
80.00
85.00
85.00
Life Expectancy (in years) vs. Internet usage
0.00
20.00
40.00
60.00
80.00
Internet users per 100 people
100.00
120.00
37 / 46
Examining Relationships
Covariance and Correlation
Therefore, it’s best to obtain a numerical measure of the
relationship. And correlation is the measure statisticians and
econometricians tend to use.
Definition
Correlation measures the strength and direction of a linear
relationship between two variables and is usually denoted as r.
rx,y = ry,x =
sx,y
sx sy
where sx,y is the sample covariance, and sx and sy are sample
standard deviations of x and y, respectively. The former (i.e.
sample covariance) is calculated as:
N
1 X
sx,y = sy,x =
(xi − x̄)(yi − ȳ).
N − 1 i=1
38 / 46
Examining Relationships
Understanding covariance
To see how a scatter diagram can be read in terms of covariance
between the two variables, consider the USA:
Education and GDP per capita (2010)
12
Because xUSA > x̄ and yUSA > ȳ, the
term (xUSA − x̄)(yUSA − ȳ) is positive.
Also, (xCOD − x̄)(yCOD − ȳ) > 0, but
(xKWT − x̄)(yKWT − ȳ) < 0.
USA
)
7
8
9
10
yUSA −ȳ









11
KWT
6
Log of real GDP per capita
(at constant 2005 national prices)
xUSA −x̄
COD
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Thus, countries located in the top-right
and bottom-left quadrants have a
positive effect on sx,y , whereas
countries in the top-left and
bottom-right quadrants have a negative
effect on sx,y .
Average years of total schooling
Question: Should we use covariance or correlation as a more
"robust" measure of the relationship? Why?
39 / 46
Examining Relationships
Understanding covariance
To answer this question, let’s look more closely at how covariance
behaves: A positive (negative) covariance indicates that x tends to
be above its mean value whenever y is above (below) its mean
value. A sample covariance of zero suggests that x and y are
unrelated.
In our example, sx,y = 2.69. This suggests that there is a positive
relationship between x and y. But what does the value of 2.69 tell
us about the strength of the relationship? — Nothing.
Why not? — Suppose we wanted to measure schooling in decades
instead of years. That is, we generate a new variable which equals
school measured in years divided by 10. The new covariance is
sx,y = 0.269 which is much closer to zero.
Technically speaking, covariance is not invariant to linear
transformations of the variables.
40 / 46
Examining Relationships
Covariance versus Correlation
The sample correlation coefficient addresses this problem. While sx,y may
take any value between −∞ and +∞, the correlation coefficient is
standardised such that r ∈ [−1, 1]. Recall that
sx,y
rx,y = ry,x =
sx sy
where sx,y is the covariance of x and y. sx and sy are the sample standard
deviations of x and y, respectively.
Note that because sx > 0 and sy > 0, the sign of the sample covariance is the
same as the sign of the correlation coefficient.
Correlation coefficient
I rx,y > 0 indicates positive correlation.
I rx,y < 0 indicates negative correlation.
I rx,y = 0 indicates that x and y are unrelated.
I rx,y = ±1 indicates perfect positive (negative) correlation. That is, there
exists an exact linear relationship between x and y of the form y = a + bx.
41 / 46
Examining Relationships
Correlation
In our example, rx,y = 0.7763, which indicates positive correlation
(because rx,y > 0) and that the relationship is reasonably strong
(because rx,y is not too far away from 1).
To get a better feeling for what is "strong" and "weak", we
generate 100 observations of x and y with varying degrees of
correlation and plot them on a scatter diagram.
To get a better feeling for what is "strong" and "weak", we
generate 100 observations of x and y with varying degrees of
correlation and plot them on a scatter diagram.
-4
-3
-2
-1
0
x
1
2
3
4
3
-4
-3
-2
-1
y
0
1
2
3
2
1
y
0
-1
-2
-3
-4
-4
-3
-2
-1
y
0
1
2
3
4
r(x,y)=.7
4
r(x,y)=-.9
4
r(x,y)=.9
-4
-3
-2
-1
0
x
1
2
3
4
-4
-3
-2
-1
0
x
1
2
3
4
42 / 46
Examining Relationships
Correlation
-4
-3
-2
-1
0
x
1
2
3
4
3
-4
-3
-2
-1
y
0
1
2
3
2
1
y
0
-1
-2
-3
-4
-4
-3
-2
-1
y
0
1
2
3
4
r(x,y)=0
4
r(x,y)=0
4
r(x,y)=.3
-4
-3
-2
-1
0
x
1
2
3
4
-4
-3
-2
-1
0
x
1
2
3
4
What’s unusual about the right-most diagram here?
In the right-most diagram, the correlation coefficient indicates that
x and y are unrelated, but the graph implies otherwise. In fact,
there is a strong quadratic relationship between x and y in this
case.
43 / 46
Examining Relationships
Summary
I
Correlation, r, measures the strength and direction of a
linear relationship between two variables.
I
The sign of r indicates the direction of the relationship:
r > 0 for a positive association and r < 0 for a negative
one.
I
r always lies within [−1, 1] and indicates the strength of a
relationship by how close it is to 1 or −1.
44 / 46
Examining Relationships
Correlation vs Causation
You may have already encountered the statement that Correlation does
not imply causation. This is an important concept to grasp, because
even a strong correlation between two variables is not enough to draw
conclusions about causation. For instance, consider the following
examples:
1. Do televisions increase life expectancy?
2. Are big hospitals bad for you?
3. Do firefighters make fires worse?
45 / 46
Examining Relationships
Correlation vs Causation
You may have already encountered the statement that Correlation does
not imply causation. This is an important concept to grasp, because
even a strong correlation between two variables is not enough to draw
conclusions about causation. For instance, consider the following
examples:
1. Do televisions increase life expectancy?
There is a high positive correlation between the number of television sets
per person in a country and life expectancy in that country. That is,
nations with more TV sets per person have higher life expectancies. Does
this imply that we could extend people’s lives in a country just by
shipping TVs to them? No, of course not. The correlation between these
two variables stem from the nation’s income: Richer nations have more
TVs per person than poorer ones. These nations also have access to
better nutrition and health care.
2. Are big hospitals bad for you?
3. Do firefighters make fires worse?
45 / 46
Examining Relationships
Correlation vs Causation
You may have already encountered the statement that Correlation does
not imply causation. This is an important concept to grasp, because
even a strong correlation between two variables is not enough to draw
conclusions about causation. For instance, consider the following
examples:
1. Do televisions increase life expectancy?
2. Are big hospitals bad for you?
A study has found positive correlation between the size of a hospital
(measured by its number of beds) and the median number of days that
patients remain in the hospital. Does this mean that you can shorten a
hospital stay by choosing a small hospital?
3. Do firefighters make fires worse?
45 / 46
Examining Relationships
Correlation vs Causation
You may have already encountered the statement that Correlation does
not imply causation. This is an important concept to grasp, because
even a strong correlation between two variables is not enough to draw
conclusions about causation. For instance, consider the following
examples:
1. Do televisions increase life expectancy?
2. Are big hospitals bad for you?
3. Do firefighters make fires worse?
A magazine has observed that "there’s a strong positive correlation
between the number of firefighters at a fire and the damage the fire does.
So sending lots of firefighters just causes more damage." Is this reasoning
flawed?
45 / 46
Examining Relationships
Reverse Causality
In addition to correlation feeding through a third (sometimes unobserved)
variable, in economics, we often run into reverse causality problems.
Earlier, we showed that real GDP per capita and education (measured by
average years of schooling) are positively correlated. This could be
because:
1. Rich countries can afford more (and better) education. That is, an
increase in GDP per capita causes an increase in schooling.
2. More (and better) education promotes innovation and productivity.
That is, an increase in schooling causes an increase in GDP per
capita.
The relationship between GDP per capita and education suffers from
reverse causality.
To reiterate, although we can make the statement that x and y are
correlated, we do not know whether y is caused by x or vice versa.
This is one of the central problems in empirical research in economics. In
the course of the MSc, you will learn methods that allow you to identify
the causal mechanisms in the relationship between y and x.
46 / 46