Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 2 Describing Data II © Summarizing and Describing Data Frequency distribution and the shape of the distribution Measures of variability 1. Frequency distribution and the shape of the distribution In the previous lecture, we saw that the mean of the household savings gives an inflated image of the saving of a “normal household”. This was because the shape of the histogram was not symmetric. It is important to look at how the observations are distributed. 38,00040,000 36,00038,000 34,00036,000 32,00034,000 30,00032,000 28,00030,000 26,00028,000 24,00026,000 22,00024,000 20,00022,000 18,00020,000 16,00018,000 14,00016,000 12,00014,000 10,00012,000 8,000-10,000 6,000-8,000 4,000-6,000 2,000-4,000 below 2,000 Savings in thousand yen Above 40,000 16 14.1 10,520,000 Sample Average 14 =17,280,000 10.7 10.6 12 9.5 10 8.2 6.9 6.2 8 5.1 4.5 6 3.5 3 3 2.7 4 2 2 1.9 1.7 1.2 1.3 1 1 2 0 Percentage Japanese household savings Histgram of Japanese Household Savings Median = 1-1 Frequency Distribution The frequency table that we used in the previous lecture is also called the frequency distribution. A frequency distribution is usually referred to how observations are distributed. When we plot the frequency table, it is called a Histogram. A histogram usually shows the number of observations in a specific range. However, sometimes, it shows the percentage of observations in a specific range. 1-2 Shape of the Distribution The shape of the distribution refers to the shape of the Histogram. 1-3 Symmetric Distribution The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the mean. The shape of the distribution is symmetric if the shape of the histogram is symmetric Symmetric Distribution Frequency Symmetric Distribution 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Note: For a symmetric distribution, the mean and median are equal. Symmetric Distribution: An example The age distribution of the clients (from the previous lecture note) is nearly symmetric. 1-4 Skewed Distribution A distribution is skewed if the observations are not symmetrically distributed above and below the mean. A positively skewed (or skewed to the right) distribution has a tail that extends to the right in the direction of positive values. A negatively skewed (or skewed to the left) distribution has a tail that extends to the left in the direction of negative values. Positively skewed distribution Positively Skewed Distribution 12 Frequency 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 Positively skewed distribution: An example The household saving histogram (from the previous lecture) is an example of a positively skewed distribution. Histgram of Japanese Household Savings Percentage Median = Sample Average =17,280,000 3 2.7 2 1 Above 40,000 38,00040,000 36,00038,000 34,00036,000 32,00034,000 30,00032,000 28,00030,000 Savings in thousand yen 10.7 2 1.9 1.7 1.2 1.3 1 26,00028,000 24,00026,000 22,00024,000 20,00022,000 18,00020,000 16,00018,000 14,00016,000 12,00014,000 10,00012,000 8,000-10,000 6,000-8,000 4,000-6,000 2,000-4,000 below 2,000 16 14.1 10,520,000 14 10.6 12 9.5 10 8.2 6.9 6.2 8 5.1 4.5 6 3.5 3 4 2 0 Positively skewed distribution: A note For a positively skewed distribution the mean is greater than the median. Negatively skewed distribution Negatively Skewed Distribution 12 Frequency 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 Note: For a negatively skewed distribution, the mean is less than the median. 2. Measures of Variability Variance Standard deviation Example Data “Sales at two different stores” contain daily sales data for two different stores. Data are collected for 60 days. Store A’s average daily sales is 231,800 yen. Store B’s average daily sales is 230,500 yen. Can we say that they are similar stores? Look at the following graphs. Daily sales of the two stores Store B: Daily Sales 450.0 450.0 400.0 400.0 350.0 350.0 300.0 Average = 231,800 yen 250.0 200.0 150.0 Daily sales in 100 yen Daily sales in 1000 yen Store A: Daily Sales 300.0 200.0 150.0 100.0 100.0 50.0 50.0 0.0 0.0 0 10 20 30 40 Day 50 60 70 Average = 230,500 yen 250.0 0 10 20 30 40 Day 50 60 70 Daily sales of the two stores The difference between the two stores is that, Store A’s sales have much higher variation than Store B’s sales. We need a measure of variability in data. 2-1 How to measure the variability (1) Store A: Daily Sales For each observation, you can compute the difference from the average 450.0 400.0 Daily sales in 1000 yen Take the Store A’s data as an example, variability of each observation can be seen from the difference between the observation and the mean. But, how do we measure the overall variability of the data? 350.0 300.0 Average = 231,800 yen 250.0 200.0 150.0 100.0 50.0 0.0 0 10 20 30 40 Day 50 60 70 How to measure the variability (2) Overall variability Store A: Daily Sales For each observation, you can compute the difference from the average 450.0 400.0 Daily sales in 1000 yen How about taking the average of all differences? This is not a good idea, since the differences can be both positive or negative, so they would sum up to zero. Therefore, we take the square of each difference. This is the first step to compute the “Variance”, a measure of overall variability. 350.0 300.0 Average = 231,800 yen 250.0 200.0 150.0 100.0 50.0 0.0 0 10 20 40 30 Day 50 60 70 2-2 Variance A measure of variability 1. 2. 3. 4. Variance is computed in the following way. Subtract the mean from each observation (compute the difference between each observation and the mean. Note that the difference can be minus) Then, square each difference Sum all the squared differences Divide the sum of squared differences by n-1 (the number of observations minus 1) We will learn the reason why we divide the sum of squares by n-1 after we learn the concept of the expectation. Computation of the variance: Exercise Open the data “Computation of Variance”, and compute the variance of Store A’s daily sales Compute the variance of Store B’s daily sales Computation of the variance: Exercise Store A: Average daily sales =231.8 thousand yen Variance =4979.9 Store B: Average daily sales=230.5 thousand yen Variance =335.9 Notice that variance for Store A is higher than that for Store B. This is because the variation in the daily sales is higher for Store A. Variance: note In the previous slide, we did not use any unit of measurement for variance. (For example, we do not say that the variance for Store A is 4979.9 thousand yen.) This is because, when we compute the variance, we square the data. Therefore, the unit of measurement for variance is “square of thousand yen”, which is not a meaningful unit. Therefore, we use the Standard Deviation, another measure of variation. 2-3 A measure of variability: Standard deviation Standard deviation is the square root of the variance. Standard Deviation Variance Exercise: Compute the standard deviation of the daily sales for Store A and Store B. Standard Deviation: Store sales data example Standard deviation of Store A’s daily sales=70.57 thousand yen. Standard deviation for Store B’s daily sales= 18.33 thousand yen. This means that the average variation of the store A’s sales is about 70.6 thousand yen, and the average variation of the store B’s sales is about 18.3 thousand yen. Standard deviation and variance as measures of risk (or uncertainty) Often standard deviation and variance are used as measures of uncertainty or risk. If you would like to work as a store manager, then store B may be a better store to work for; although the average sales is almost the same as store A, the uncertainty is lower (low standard deviation) Standard deviation and variance as measures of risk (or uncertainty) In the store sales data, the average sales for both stores are similar. However, in many other occasions, higher return (higher average sales) comes with higher risk (higher standard deviation). One makes a decision by choosing a good combination of return and risk. For example, if you invest in a stock, you would choose a stock with a combination of return and risk that suits your preference. Therefore, standard deviation and variance are important numerical measures of summarizing data for a decision making purpose. 2-4. Understanding the mathematical notation of the variance Most of the time, we only have sample data (not population data). Variance computed from a sample is called sample variance. We denote sample variance by s 2. When we have population data (which does not happen often), we can compute the population variance. We denote the population variance by σ2. Understanding the mathematical notation of sample variance Observation id Variable X 1 x1 2 x2 3 x3 . . . . n xn The typical data we use comes in this format. Using this format, we would like to represent variance in a mathematical form. Understanding the mathematical notation of sample variance Obs id Variable X Each datathe mean (Each data-the mean)2 1 X1 X1 - X (X1 - X )2 2 X2 X2 - X (X2 - 2 X ) 3 X3 X3 - X (X3 - 2 X ) : : : n Xn Xn - X (Xn - X Average X )2 The first steps of computing variance are written in the table. The variance can be computed by summing the last column, and divide the sum by (n-1) Therefore, mathematically, a sample variance, s2, can be written as next page Understanding the mathematical notation for sample variance Mathematically, sample variance, denoted as s2, can be written as n 2 2 2 2 ( x X ) ( x X ) ( x X ) ( x X ) 2 3 n s2 1 n 1 2 ( x X ) i i 1 n 1 Mathematical notation for population variance Though not often, we may have population data. Then we can compute the population variance. We use the notation, σ2, to denote the population variance. We also use upper case N to denote the number of observations. The mathematical notation for the population variance is N ( x1 ) ( x2 ) ( x3 ) ( xn ) N 2 2 2 2 2 (x ) i 1 2 i N Unlike the case for sample variance, we do not have to divide the sum of squares by N-1. We simply divide it by N. 2-5. Mathematical notation for the sample standard deviation The sample standard deviation, s, is written as n s s 2 2 ( x X ) i i 1 n 1 Mathematical Notation for population standard deviation The population standard deviation, , is written as N 2 (x ) i 1 i N 2 2-6. Short-cut formula for sample variance The short-cut formula for the sample variance is: n s 2 2 2 x n ( X ) i i 1 n 1 Exercise Compute the variance for the sales of Store A by applying the short-cut formula for sample variance, and show that this indeed coincides with our previous calculation. Other Measures of Variability 1. The Range The range in a set of data is the difference between the largest and smallest observations Other Measures of Central Tendency 2. Mode The mode, if one exists, is the most frequently occurring observation in the sample or population. This lecture note covers: Textbook P23~P28: Frequency distribution Textbook 3.1, 3.2: Measures of central tendency and variability