Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic Statistics for SGPE Students Part I: Descriptive Statistics Achim Ahrens Anna Babloyan [email protected] [email protected] Erkal Ersoy [email protected] Heriot-Watt University, Edinburgh September 2015 Outline 1. Descriptive statistics I I I I Sample statistics (mean, variance, percentiles) Graphs (box plot, histogram) Data transformations (log transformation, unit of measure) Correlation vs. Causation 2. Probability theory I I Conditional probabilities and independence Bayes’ theorem 3. Probability distributions I I I I Discrete and continuous probability functions Probability density function & cumulative distribution function Binomial, Poisson and Normal distribution E[X] and V[X] 4. Statistical inference I I I I I Population vs. sample Law of large numbers Central limit theorem Confidence intervals Hypothesis testing and p-values 1 / 46 Descriptive statistics I In recent years, more and better-quality data have been recorded than any other time in history. I The increasing size of data sets that are readily available to us has enabled us to adopt new and more robust statistical tools. I Rising data availability has (unfortunately) led to empirical researchers to sometimes overlook some preliminary steps, such as summarizing and visually examining their data sets. I Ignoring these preliminary steps can lead to important issues and invalidate seemingly significant results. As we will see in this and following lectures, there are ways in which we can numerically summarize a data set. Before we discuss those approaches, let’s take a quick look at what’s available to us to visualize a data set graphically. 2 / 46 Descriptive statistics Histograms I Histograms are extremely useful in getting a good graphical representation of the distribution of data. These figures consist of adjacent rectangles over discrete intervals, whose areas are the frequency of observations in the interval. I Histograms are often normalized to show the proportion (or densities) of observations that fall into non-overlapping categories. In such cases, the total area under the bins equal 1. Remark The height of each bin in a normalized histogram represents density or proportion of observations that fall into that category. These can more easily be interpreted as percentages. 3 / 46 Descriptive statistics Histograms 0 .01 .02 .03 Density .04 .05 .06 .07 .08 1960 25 I I I 30 35 40 45 50 55 60 65 Life expectancy (in years) 70 75 80 85 Approximately, what is the average life expectancy in 1960? Roughly what percentage of countries had life expectancy above 65? What proportion of countries had a life expectancy less than 55 years? 4 / 46 Descriptive statistics Histograms 0 .01 .02 .03 Density .04 .05 .06 .07 .08 1960 25 30 35 40 45 50 55 60 65 Life expectancy (in years) 70 75 80 85 5 / 46 Descriptive statistics Histograms 0 .01 .02 .03 Density .04 .05 .06 .07 .08 1990 25 30 35 40 45 50 55 60 65 Life expectancy (in years) 70 75 80 85 5 / 46 Descriptive statistics Histograms 0 .01 .02 .03 Density .04 .05 .06 .07 .08 2011 25 30 35 40 45 50 55 60 65 Life expectancy (in years) 70 75 80 85 5 / 46 Descriptive statistics The Mean and Standard Deviation I A histogram can help summarize large amounts of data, but we often like to see an even shorter (and sometimes easier to interpret) summary. This is usually provided by the mean and the standard deviation. I The mean (and median) are frequently used to find the center, whereas standard deviation measures the spread. Definition The (arithmetic) mean of a list of numbers is their sum divided by how many there are. For example, the mean of 9, 1, 2, 2, 0 is More generally, mean = x̄ = 1 n n P xi ; 9+1+2+2+0 5 = 2.8. i = 1...n i=1 6 / 46 Descriptive statistics The Mean and Standard Deviation I I The standard deviation (SD) tells us how far numbers on a list deviate from their average. Usually, most numbers are within one SD around the mean. More specifically, for normally distributed variables, about 68% of entries are within one SD of the mean and about 95% of entries are within two SDs. 68% mean mean - one SD mean + one SD 95% mean mean - two SDs mean + two SDs 7 / 46 Descriptive statistics Computing the Standard Deviation Definition p Standard Deviation = mean of (deviations from the mean)2 where deviation from mean = entry − mean And in formal notation, σ = µ = N1 (x1 + ... + xN ). q 1 N PN i=1 (xi − µ)2 , where Example: Find the SD of 20, 10, 15, 15. Answer: mean = x̄ = 20+10+15+15 = 15 Then, the deviations are 4 5, −5, 0, 0,q respectively. q √ 2 2 2 2 So, SD = 5 +(−5)4 +0 +0 = 50 12.5 ≈ 3.5 4 = Remark The SD comes out in the same units as the data. For example, if the data are a set of individuals’ heights in inches, the SD is in inches too. 8 / 46 Descriptive statistics The Root-Mean-Square Consider the following list of numbers: 0, 5, −8, 7, −3. Question: How big are these numbers? What is their mean? The mean is 0.2, but this does not tell us much about the size of the numbers–it only implies that the positive numbers slightly outweigh the negative ones. To get a better sense of their size, we could use the mean of their absolute values. Statisticians tend to use another measure, though: The root-mean-square. Definition Root −mean − square (rms) = p average of (entries)2 9 / 46 Descriptive statistics The Root-Mean-Square and Standard Deviation There is an alternative way of calculating SD using root-mean-square: Remark SD = p mean of (entries)2 − (mean of entries)2 Recall the four numbers we used earlier to calculate SD: 20, 10, 15, 15. 2 2 2 2 +15 mean of (entries)2 = 20 +10 +15 = 950 4 4 = 237.5 60 2 20+10+15+15 2 2 ) = ( 4 ) = 225 Therefore, (mean of entries) = ( 4 √ √ SD = 237.5 − 225 = 12.5 ≈ 3.5, which agrees with what we found earlier. 10 / 46 Descriptive statistics Variance I In probability theory and statistics, variance gets mentioned nearly as often as the mean and standard deviation. It is very closely related to SD and is a measure of how far a set of numbers lie from their mean. I Variance is the second moment of a distribution (mean being the first moment), and therefore, tells us about the properties of the distribution (more on these later). Definition Variance = (Std. Dev.)2 = σ 2 11 / 46 Descriptive statistics Normal Approximation for Data and Percentiles S&P 500, January 2001 - December 2001 -1 s.d. mean +1 s.d. +2 s.d. +3 s.d. +4 s.d. 10,000 15,000 Volume (thousands) 0 10 20 Frequency 30 40 50 60 -2 s.d. 5,000 20,000 25,000 Source: Yahoo! Finance and Commodity Systems, Inc. Is the normal approximation satisfactory here? 12 / 46 Descriptive statistics Normal Approximation for Data and Percentiles 2011 with normal -1 s.d. mean +1 s.d. +2 s.d. 0 .01 .02 .03 Density .04 .05 .06 .07 .08 -2 s.d. 45 50 55 60 65 70 75 80 Life expectancy (in years) 85 90 95 How about here? 13 / 46 Descriptive statistics Normal Approximation for Data and Percentiles Remark The mean and SD can be used to effectively summarize data that follow the normal curve, but these summary statistics can be much less satisfactory for data that do not follow the normal curve. In such cases, statisticians often opt for using percentiles to summarize distributions. Table. Selected percentiles for life expectancy in 2011 Percentiles Value 1 48 10 52.6 25 63 50 73.4 75 76.9 95 81.8 99 82.7 14 / 46 Descriptive statistics Calculating percentiles 1. Order all the values in your data set in ascending order (i.e. smallest to largest). 2. Select a percentile, P, that you would like to calculate and multiply it by the total number of entries in your data set, n. The value you obtain here is called the index. 3. If the index is not a whole number, round it up to the next integer. 4. Count the entries in your list of numbers starting from the smallest one until you get to the number indicated by your index. 5. This entry is the kth percentile in your data set. 15 / 46 Descriptive statistics Calculating percentiles Example Consider the following list of 5 numbers: 10, 15, 20, 25, 30. What is the entry that corresponds to the 25th percentile? What is the median? To obtain the 25th percentile, all we need to do is 0.25 × 5 = 1.25. After rounding, this value becomes 1, so 25th percentile in this case is the first entry, 10. We were also asked to obtain the median. To do this, calculate 0.5 × 5 = 2.5. Rounding this to the nearest whole number gives 3. So, the median in this case is 20. 16 / 46 Descriptive statistics Percentiles The 1st percentile of the distribution is approximately 48, meaning that the life expectancy in 1% of countries in 2011 was 48 or less, and 99% of countries had life expectancy higher than that. Similarly, the fact that 25th percentile is 63 implies that 25% of countries had life expectancy of 63 or less, whereas 75% had a longer expected lifespan. Definition Interquartile range is defined as 75th percentile − 25th percentile and is sometimes used as a measure of spread, particularly when the SD would pay too much (or too little) attention to a small percentage of cases in the tails of the distribution. From the table above, the interquartile range equals 76.9 − 63 = 13.9 (and SD was 10.14). 17 / 46 Descriptive statistics Box plots The structure of a box plot: Adjacent line (Upper adjacent value) The largest value within 75th percentile + Whiskers 75th percentile/3rd quartile (upper hinge) Box 50th percentile (median) 25th percentile/1st quartile (lower hinge) The smallest value within 25th percentile - Whiskers Adjacent line (Lower adjacent value) Entries less than the lower adjacent value 18 / 46 Descriptive statistics Box plots Are there any clear patterns emerging from summarizing the data this way? Life expectancy (in years) 70 80 90 50 60 Life expectancy by region in 2011 EAS ECS LCN MEA NAC SAS SSF Legend EAS: East Asia & Pacific ECS: Europe & Central Asia LCN: Latin America & Caribbean MEA: Middle East & North Africa NAC: North America SAS: South Asia SSF: Sub-Saharan Africa 19 / 46 Descriptive statistics Box plots We might be able to spot some patterns that developed over time if we look at different years: Life expectancy by region 25 35 Life expectancy (in years) 65 55 75 45 85 1960 EAS ECS LCN MEA NAC SAS SSF 20 / 46 Descriptive statistics Box plots We might be able to spot some patterns that developed over time if we look at different years: Life expectancy by region 25 35 Life expectancy (in years) 65 55 75 45 85 1990 EAS ECS LCN MEA NAC SAS SSF 20 / 46 Descriptive statistics Box plots We might be able to spot some patterns that developed over time if we look at different years: Life expectancy by region 25 35 Life expectancy (in years) 65 55 75 45 85 2011 EAS ECS LCN MEA NAC SAS SSF 20 / 46 Data Transformations The effects of changing the unit of measure I I I I Now that we know how to summarize a dataset, let us turn to investigating the effects of changing the unit of measure for a variable on the mean and standard deviation. Such changes in the unit of measure could be for practical reasons or based on theory, but regardless of the reason, a statistician should know what to expect. To study this, let’s consider a dataset on 200 individuals’ weights and heights. Each entry is originally reported in kg and cm, respectively, and below are some summary statistics: Table. Variable Weight (kg) Height (cm) Summary statistics Mean Standard Deviation 65.8 15.1 170.02 12.01 21 / 46 Data Transformations The effects of changing the unit of measure And here are some diagrams that summarize the distribution of the two variables. Weight Height measured in kg measured in cm mean +1 s.d. +2 s.d. -2 s.d. -1 s.d. mean +1 s.d. +2 s.d. 0 0 .01 .01 Density .02 Density .02 .03 .03 .04 .04 -2 s.d. -1 s.d. 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Measured weight in kg 140 150 160 170 180 Measured height in cm 190 200 Does the normal approximation look satisfactory? 22 / 46 Data Transformations The effects of changing the unit of measure And here are some diagrams that summarize the distribution of the two variables. 50 50 Measured weight in kg 150 100 Measured height in cm 100 150 200 Height (cm) by sex 200 Weight (kg) by sex F M F M 23 / 46 Data Transformations The effects of changing the unit of measure And here are some diagrams that summarize the distribution of the two variables. Weight Weight measured in kg measured in lb -2 s.d. -1 s.d. mean +1 s.d. +2 s.d. Density .01 0 0 .01 Density .02 .03 .02 mean +1 s.d. +2 s.d. .04 -2 s.d. -1 s.d. 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Measured weight in kg 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 Measured weight in pounds Do you think the mean matches the original one (in correct units)? How about the standard deviation? 24 / 46 Data Transformations The effects of changing the unit of measure And here are some diagrams that summarize the distribution of the two variables. Height Height measured in cm -1 s.d. mean measured in in +1 s.d. +2 s.d. -2 s.d. -1 s.d. mean +1 s.d. +2 s.d. 65 70 Measured height in inches 75 0 0 .02 .01 .04 Density .06 .08 Density .02 .03 .1 .04 .12 -2 s.d. 140 150 160 170 180 Measured height in cm 190 200 55 60 80 Do you think the mean matches the original one (in correct units)? How about the standard deviation? 25 / 46 Data Transformations The effects of changing the unit of measure Here are the box plots with the transformed data: Height (in) by sex 20 100 Measured height in inches 40 60 Measured weight in pounds 200 250 300 150 350 80 Weight (lb) by sex F M F M 26 / 46 Data Transformations The effects of changing the unit of measure I I I Observations made using the figures are, of course, based on what statisticians and econometricians often call "eye-balling" the data. These observations are certainly not formal, but are a crucial part of effectively analyzing any dataset. In fact, you should make plotting, investigating and eye-balling your data a habit before you dive into complicated models and overlook important features of your dataset. Now that we have made our informal observations, let’s look at the actual numbers. Variable Weight (kg) Height (cm) Weight (lb) Height (in) Table. Summary statistics Mean SD Mean (converted) 65.8 15.1 65.8 × 2.2 ≈ 145.06 170.02 12.01 66.94 145.06 33.28 145.06/2.2 ≈ 65.8 66.94 4.73 170.02 SD (converted) 15.1 × 2.2 ≈ 33.28 4.73 33.28/2.2 ≈ 15.1 12.01 27 / 46 Data Transformations The effects of changing the unit of measure I I I Observations made using the figures are, of course, based on what statisticians and econometricians often call "eye-balling" the data. These observations are certainly not formal, but are a crucial part of effectively analyzing any dataset. In fact, you should make plotting, investigating and eye-balling your data a habit before you dive into complicated models and overlook important features of your dataset. Now that we have made our informal observations, let’s look at the actual numbers. Variable Weight (kg) Height (cm) Weight (lb) Height (in) Table. Summary statistics Mean SD Mean (converted) 65.8 15.1 145.06 170.02 12.01 170.02 × 0.4 ≈ 66.94 145.06 33.28 65.8 66.94 4.73 66.94 × 2.5 ≈ 170.02 SD (converted) 33.28 12.01 × 0.4 ≈ 4.73 15.1 4.73 × 2.5 ≈ 12.01 27 / 46 Data Transformations The effects of changing the unit of measure We have seen that the mean and the standard deviation remain the same when we change the unit of measure, but how does variance behave? Variable Weight (kg) Height (cm) Weight (lb) Height (in) Mean 65.8 170.02 145.06 66.94 Table. Summary statistics Variance Mean (converted) 228.01 65.8 × 2.2 ≈ 145.06 144.24 66.94 1107.56 145.06/2.2 ≈ 65.8 22.37 170.02 Variance (converted) 228.01 × 2.2 ≈ 502.68 56.79 1107.56/2.2 ≈ 502.38 56.81 28 / 46 Data Transformations The effects of changing the unit of measure We have seen that the mean and the standard deviation remain the same when we change the unit of measure, but how does variance behave? Variable Weight (kg) Height (cm) Weight (lb) Height (in) Mean 65.8 170.02 145.06 66.94 Table. Summary statistics Variance Mean (converted) 228.01 145.06 144.24 170.02 × 0.4 ≈ 66.94 1107.56 65.8 22.37 66.94 × 2.54 ≈ 170.02 Variance (converted) 502.68 144.24 × 0.4 ≈ 56.79 502.38 22.37 × 2.54 ≈ 56.81 1 Note that 1 inch = 2.54 cm and similarly, 1cm = 2.54 = 0.3937in. 2 Then, 22.37 ≈ (0.3937) × 144.24. The opposite is true as well: 144.24 ≈ (2.54)2 × 22.37. And we can apply the same to the weights in kg and lbs. And in general ... 28 / 46 Data Transformations Properties of Variance ...variance is scaled by the square of the constant by which all the values are scaled. While we are at it, here are some basic properties of variance: Basic properties of variance I Variance is non-negative: Var(X ) ≥ 0 I Variance of a constant random variable is zero: P(X = a) = 1 ↔ Var(X ) = 0 I Var(aX) = a2 Var(X) I However, Var(X + a) = Var(X ) I For two random variables X and Y , Var(aX + bY ) = a 2 Var(X ) + b2 Var(Y ) + 2abCov(X , Y ) I ...but Var(X − Y ) = Var(X ) + Var(Y ) − 2Cov(X , Y ) F 29 / 46 Data Transformations Log Transformation So far, we have only worked with transformations in which we multiply each value with a constant. However, more complicated transformations are quite common in statistics and econometrics. One of the most common and useful transformations uses the natural logarithm. Definition Data transformation refers to applying a specific operation to each point in a dataset, in which each data point is replaced with the transformed one. That is, xi are replaced by yi = f (xi ). In our previous example with heights, our function, f (x), was simply f (x) = 2.54x. Now, let us study a different function: the natural logarithm. 30 / 46 Output-side real GDP at current PPPs (in mil. 2005US$) 500000.000 1000000.000 1500000.000 2000000.000 UK Real GDP 1960 1970 1980 Year 1990 Natural log of output-side real GDP at current PPPs 13 13.5 14.5 14 Data Transformations Log Transformation Log transformation in action: 2000 2010 UK Real GDP 1960 1970 1980 Year 1990 2000 2010 31 / 46 Data Transformations Log Transformation UK Real GDP Natural log of real GDP at current PPPs (in mil. 2005US$) 14 13.5 14.5 UK Real GDP 13 0.000 Output-side real GDP at current PPPs (in mil. 2005US$) 1000000.000 2000000.000 Log transformation in action: 1960 1970 1980 1990 Year 2000 2010 1960 1970 1980 1990 2000 2010 Year 31 / 46 Data Transformations Log Transformation Log Life expectancy vs. Log Real GDP 3.8 50 Life expectancy (in years) 4.2 4 Life expectancy (in years) 60 70 80 4.4 90 Life Expectancy vs. Real GDP 0.000 5000000.000 10000000.000 Output-side real GDP at current PPPs (in mil. 2005US$) 15000000.000 6 8 10 12 14 16 Natural log of output-side real GDP at current PPPs Important note The log transformation can only be used for variables that have positive values (why?). If the variable has zeros, the transformation can be applied only after these figures are replaced (usually by one-half of the smallest positive value in the data set). 32 / 46 Data Transformations Log Transformation Year: 2011 Region JPN GBR 80 EAS USA Life expectancy (in years) [linear scale] ECS LCN CHN IDN MEA RUS NAC IND SAS SSF 60 Population (in million) ZAF 10 50 100 250 40 500 1000 0 10000 20000 30000 40000 50000 60000 Real GDP per capita (at constant 2005 national prices) [linear scale] 33 / 46 Data Transformations Log Transformation Year: 1960 Region 80 EAS Life expectancy (in years) [linear scale] ECS LCN GBR USA MEA JPN NAC SAS SSF 60 Population (in million) 10 ZAF CHN 50 IDN IND 100 250 40 500 1000 156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000 Real GDP per capita (at constant 2005 national prices) [log scale] 33 / 46 Data Transformations Log Transformation Year: 1990 Region Life expectancy (in years) [linear scale] 80 EAS JPN GBRUSA CHN ECS LCN MEA RUS NAC SAS IDN 60 ZAF SSF IND Population (in million) 10 50 100 250 40 500 1000 156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000 Real GDP per capita (at constant 2005 national prices) [log scale] 33 / 46 Data Transformations Log Transformation Year: 2011 Region JPN GBR USA 80 EAS Life expectancy (in years) [linear scale] ECS LCN CHN IDN MEA RUS NAC IND SAS SSF 60 Population (in million) ZAF 10 50 100 250 40 500 1000 156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000 Real GDP per capita (at constant 2005 national prices) [log scale] 33 / 46 Data Transformations Log Transformation and growth A useful feature of the log transformation is the interpretation of its first difference as a percentage change (for small changes). This is because ln(1 + x) ≈ x for a small x: Wolfram Alpha Strictly speaking, a percentage change in Y from period t − 1 to t−1 , which is approximately equal to period t is defined as YtY−Y t−1 ln(Yt ) − ln(Yt−1 ). And the approximation is almost exact if the percentage change is small. To see this, consider the percentage change in US GDP from 2010 to 2011: Table. US Real GDP (in mil. 2005 US$) Year GDP Percentage change ln(Yt ) ln(Y2011 ) − ln(Y2010 ) 2010 12993576 1.803507 16.379966 1.787436 2011 13227916 . 16.39784 . And the difference in percentage change is 0.01803507 − 0.01787436 = 0.00016071—a discrepancy that we might be willing to live with. 34 / 46 Examining Relationships Covariance and Correlation Our daily lives (and not just within economics) are filled with statements about the relationship between two variables. For example, we might read about a study that found that men spend more money online than women. The relationship between gender and spending more online may not be this simple, of course–income might play a role in this observed pattern. Ideally, we would like to set up an experiment in which we control the behavior of one variable (keeping everything else the same) and observe its effect on another. This is often not feasible in economics (a lot more on this later!). For the time being, let’s focus on simple correlation. 35 / 46 Examining Relationships Covariance and Correlation Scatter plots are very useful in identifying the sign and strength of the relationship between two variables. Therefore, it’s always extremely useful to plot your data and investigate what the relationship between your two variables are: 65.00 Life expectancy 70.00 75.00 80.00 85.00 Life Expectancy (in years) vs. Internet usage 0.00 20.00 40.00 60.00 Internet users per 100 people 80.00 36 / 46 Examining Relationships Covariance and Correlation But these plots can also be misleading to the eye simply by changing the scale of the axes: 95.00 Life Expectancy (in years) vs. Internet usage 0.00 20.00 40.00 60.00 Internet users per 100 people 80.00 45.00 65.00 55.00 Life expectancy 75.00 65.00 Life expectancy 70.00 75.00 80.00 85.00 85.00 Life Expectancy (in years) vs. Internet usage 0.00 20.00 40.00 60.00 80.00 Internet users per 100 people 100.00 120.00 37 / 46 Examining Relationships Covariance and Correlation Therefore, it’s best to obtain a numerical measure of the relationship. And correlation is the measure statisticians and econometricians tend to use. Definition Correlation measures the strength and direction of a linear relationship between two variables and is usually denoted as r. rx,y = ry,x = sx,y sx sy where sx,y is the sample covariance, and sx and sy are sample standard deviations of x and y, respectively. The former (i.e. sample covariance) is calculated as: N 1 X sx,y = sy,x = (xi − x̄)(yi − ȳ). N − 1 i=1 38 / 46 Examining Relationships Understanding covariance To see how a scatter diagram can be read in terms of covariance between the two variables, consider the USA: Education and GDP per capita (2010) 12 Because xUSA > x̄ and yUSA > ȳ, the term (xUSA − x̄)(yUSA − ȳ) is positive. Also, (xCOD − x̄)(yCOD − ȳ) > 0, but (xKWT − x̄)(yKWT − ȳ) < 0. USA ) 7 8 9 10 yUSA −ȳ 11 KWT 6 Log of real GDP per capita (at constant 2005 national prices) xUSA −x̄ COD 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Thus, countries located in the top-right and bottom-left quadrants have a positive effect on sx,y , whereas countries in the top-left and bottom-right quadrants have a negative effect on sx,y . Average years of total schooling Question: Should we use covariance or correlation as a more "robust" measure of the relationship? Why? 39 / 46 Examining Relationships Understanding covariance To answer this question, let’s look more closely at how covariance behaves: A positive (negative) covariance indicates that x tends to be above its mean value whenever y is above (below) its mean value. A sample covariance of zero suggests that x and y are unrelated. In our example, sx,y = 2.69. This suggests that there is a positive relationship between x and y. But what does the value of 2.69 tell us about the strength of the relationship? — Nothing. Why not? — Suppose we wanted to measure schooling in decades instead of years. That is, we generate a new variable which equals school measured in years divided by 10. The new covariance is sx,y = 0.269 which is much closer to zero. Technically speaking, covariance is not invariant to linear transformations of the variables. 40 / 46 Examining Relationships Covariance versus Correlation The sample correlation coefficient addresses this problem. While sx,y may take any value between −∞ and +∞, the correlation coefficient is standardised such that r ∈ [−1, 1]. Recall that sx,y rx,y = ry,x = sx sy where sx,y is the covariance of x and y. sx and sy are the sample standard deviations of x and y, respectively. Note that because sx > 0 and sy > 0, the sign of the sample covariance is the same as the sign of the correlation coefficient. Correlation coefficient I rx,y > 0 indicates positive correlation. I rx,y < 0 indicates negative correlation. I rx,y = 0 indicates that x and y are unrelated. I rx,y = ±1 indicates perfect positive (negative) correlation. That is, there exists an exact linear relationship between x and y of the form y = a + bx. 41 / 46 Examining Relationships Correlation In our example, rx,y = 0.7763, which indicates positive correlation (because rx,y > 0) and that the relationship is reasonably strong (because rx,y is not too far away from 1). To get a better feeling for what is "strong" and "weak", we generate 100 observations of x and y with varying degrees of correlation and plot them on a scatter diagram. To get a better feeling for what is "strong" and "weak", we generate 100 observations of x and y with varying degrees of correlation and plot them on a scatter diagram. -4 -3 -2 -1 0 x 1 2 3 4 3 -4 -3 -2 -1 y 0 1 2 3 2 1 y 0 -1 -2 -3 -4 -4 -3 -2 -1 y 0 1 2 3 4 r(x,y)=.7 4 r(x,y)=-.9 4 r(x,y)=.9 -4 -3 -2 -1 0 x 1 2 3 4 -4 -3 -2 -1 0 x 1 2 3 4 42 / 46 Examining Relationships Correlation -4 -3 -2 -1 0 x 1 2 3 4 3 -4 -3 -2 -1 y 0 1 2 3 2 1 y 0 -1 -2 -3 -4 -4 -3 -2 -1 y 0 1 2 3 4 r(x,y)=0 4 r(x,y)=0 4 r(x,y)=.3 -4 -3 -2 -1 0 x 1 2 3 4 -4 -3 -2 -1 0 x 1 2 3 4 What’s unusual about the right-most diagram here? In the right-most diagram, the correlation coefficient indicates that x and y are unrelated, but the graph implies otherwise. In fact, there is a strong quadratic relationship between x and y in this case. 43 / 46 Examining Relationships Summary I Correlation, r, measures the strength and direction of a linear relationship between two variables. I The sign of r indicates the direction of the relationship: r > 0 for a positive association and r < 0 for a negative one. I r always lies within [−1, 1] and indicates the strength of a relationship by how close it is to 1 or −1. 44 / 46 Examining Relationships Correlation vs Causation You may have already encountered the statement that Correlation does not imply causation. This is an important concept to grasp, because even a strong correlation between two variables is not enough to draw conclusions about causation. For instance, consider the following examples: 1. Do televisions increase life expectancy? 2. Are big hospitals bad for you? 3. Do firefighters make fires worse? 45 / 46 Examining Relationships Correlation vs Causation You may have already encountered the statement that Correlation does not imply causation. This is an important concept to grasp, because even a strong correlation between two variables is not enough to draw conclusions about causation. For instance, consider the following examples: 1. Do televisions increase life expectancy? There is a high positive correlation between the number of television sets per person in a country and life expectancy in that country. That is, nations with more TV sets per person have higher life expectancies. Does this imply that we could extend people’s lives in a country just by shipping TVs to them? No, of course not. The correlation between these two variables stem from the nation’s income: Richer nations have more TVs per person than poorer ones. These nations also have access to better nutrition and health care. 2. Are big hospitals bad for you? 3. Do firefighters make fires worse? 45 / 46 Examining Relationships Correlation vs Causation You may have already encountered the statement that Correlation does not imply causation. This is an important concept to grasp, because even a strong correlation between two variables is not enough to draw conclusions about causation. For instance, consider the following examples: 1. Do televisions increase life expectancy? 2. Are big hospitals bad for you? A study has found positive correlation between the size of a hospital (measured by its number of beds) and the median number of days that patients remain in the hospital. Does this mean that you can shorten a hospital stay by choosing a small hospital? 3. Do firefighters make fires worse? 45 / 46 Examining Relationships Correlation vs Causation You may have already encountered the statement that Correlation does not imply causation. This is an important concept to grasp, because even a strong correlation between two variables is not enough to draw conclusions about causation. For instance, consider the following examples: 1. Do televisions increase life expectancy? 2. Are big hospitals bad for you? 3. Do firefighters make fires worse? A magazine has observed that "there’s a strong positive correlation between the number of firefighters at a fire and the damage the fire does. So sending lots of firefighters just causes more damage." Is this reasoning flawed? 45 / 46 Examining Relationships Reverse Causality In addition to correlation feeding through a third (sometimes unobserved) variable, in economics, we often run into reverse causality problems. Earlier, we showed that real GDP per capita and education (measured by average years of schooling) are positively correlated. This could be because: 1. Rich countries can afford more (and better) education. That is, an increase in GDP per capita causes an increase in schooling. 2. More (and better) education promotes innovation and productivity. That is, an increase in schooling causes an increase in GDP per capita. The relationship between GDP per capita and education suffers from reverse causality. To reiterate, although we can make the statement that x and y are correlated, we do not know whether y is caused by x or vice versa. This is one of the central problems in empirical research in economics. In the course of the MSc, you will learn methods that allow you to identify the causal mechanisms in the relationship between y and x. 46 / 46