Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture Notes 3: Data summarization Highlights: • • • • • • • • Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean & median 1 Some important characteristics of a data set Location: Where is the data set “located” along a number line? Where is its center? Spread: How dispersed (i.e. spread out) is the data? Outliers: set? Are there any unusual values in the data Shape: What is the shape of the distribution of values in the data set? 2 Location Statistics Mean, Median & Quartiles • In these notes, we will look at some common descriptive statistics that are useful for summarizing a data set. • Recall that a statistic is any number calculated from a set of data. • The most succinct way to describe the location of a data set is to identify its center. • There are two statistics used to describe center: with the mean and with the median. 3 Sample average • The sample average (a.k.a. mean) is the sum of the data divided by the sample size. • We denote the mean using , or “x bar” x • The sample size is the number of observations in the sample, and is denoted “n”. • The sum of all the observations in a sample is denoted by . • So, our formula for the sample mean is ∑x i x ∑ x = i n 4 Sample Average Example • Suppose we are interested in the average undulation rate (in Hz) of a paradise tree snake, which undulates after jumping from a tree in order to glide away. • We take a sample of n = 8 snakes and somehow measure the rates at which they undulate as they propel themselves from a source. • The eight observed rates are 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 5 Sample Average Example So, for this sample, we can compute: x ∑ x = n i = = 6 Median • If you put data in order from the smallest to the largest values, the number in the middle is called the median. • The median separates the bottom 50% of the data from the top 50% of the data. • If the sample size is odd, the median will be a value in your sample. If the sample size is even, the median will be “between” the middle two numbers in your sample. 7 Computing the median 1) Order the data set, smallest to largest. 2) Compute the rank of the median using Rank = (n + 1)/2. The rank tells you which observation will be the median. 3) If “Rank” is an integer value go right to it in the sorted data set. Otherwise compute the average of the two surrounding observations. ordered For instance, if rank = 5, then the median is the 5th ordered observation. If rank = 5.5, then the median is the average of the 5th and 6th ordered observations. 8 Computing the Median • The data set to the right is already ordered. There are 19 observations. 49 73 96 116 137 69 78 96 116 142 70 81 105 117 151 70 81 110 121 • Find the rank of the median using (n+1)/2: • Now go to this observation by counting from the start of the data set to the rank of the median. • You can verify that this is the median by making sure that there are the same number of observations above it as there are below it. 9 Computing the Median • The data set to the right is already ranked. There are 20 observations. • Find the rank of the median using (n+1)/2: 49 73 96 116 137 69 78 96 116 142 70 81 105 117 151 70 81 110 121 175 • In this case, the rank is between two integers, so the median will be the average of these two ordered observations. 10 Location Statistics: Quartiles • The median breaks the data set into two halves • Quartiles break the data set into 4 quarters • The lower quartile, Q1, is the “median” of all the data below the overall median. • The upper quartile, Q3, is the “median” of all the data above the overall median. 11 Computing Quartiles Here, there are 10 observations below the median. We can find their “median”, Q1, in the usual manner: 49 73 96 116 137 69 78 96 116 142 70 81 105 117 151 70 81 110 121 175 Q1 separates the lower 25% from the upper 75% of the data. 12 Computing Quartiles Likewise, there are 10 observations above the median. We can use the same rank we used to find Q1, but start counting from the first observation above the overall median: 49 73 96 116 137 69 78 96 116 142 70 81 105 117 151 70 81 110 121 175 Q3 separates the lower 75% from the top 25% of the data. 13 Computing Quartiles • A brief aside: when sample size is odd, it will not be the case that *exactly* 50% of the data is below the median or that *exactly* 50% is above it • This is because the median itself is not counted as being in either the upper or lower half of the data set. • For reasonably large data sets, we may say things like “50% of the data is above the median” and “25% of the data is below Q1”, even though in some cases these are approximations. 14 Computing Quartiles • Note that for relatively small datasets, you may be able to “eyeball” the data to find the median, Q1, and Q3, rather than using rank. • For instance, it is not challenging to find the median and quartiles for the snake undulation rate data set of size n=8 from before. • Simply order the numbers 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 from smallest to largest, and you can quickly see where the median and quartiles lie: 15 Location Statistics: Extremes • We are also often interested in the extremes of a data set. • These extreme values are referred to as the minimum and the maximum. “Extreme” in this context doesn’t necessarily mean “really big” or “really small”. It just means “the biggest” or “the smallest”. 16 The 5-number summary • The 5-number summary can be used to summarize a data set. • This group consists of the: minimum, maximum, Q1, median, and Q3 • These are all measures of location 17 Boxplots and the 5-number summary 75 60 • Sometimes boxplots are called “box and whisker plots.” 65 70 • Boxplots graphically illustrate the 5 values in a 5-number summary boxplot of height (female) 18 Boxplots and the 5-number summary • • • • Boxplots can be displayed horizontally or vertically. The dark line inside the box is the median The edges of the box are Q1 and Q3 The whiskers extend to either the min and max, or to the furthest non-outliers. 19 Boxplots and the 5-number summary • Outliers are represented as dots on a boxplot. • Note: 50% of the data is inside the box, 25% is below the box, and 25% is above the box. 20 Outliers • Outliers are data points that are located far away from where the majority of the data lie. • There is not universal agreement on what the standard should be for classifying an observation as an outlier. It is to some extent subjective. • Data analysis software packages will have internal standards by which they decide which values should be considered outlying. 21 Outliers • It’s usually a good idea to look more closely at an outlier to see if it is real or if it is a mistake. • The outlier might be an improperly entered data value. Data entry is a tedious process and sometimes people make mistakes. • The outlier might be in different units than the rest of the data. For instance, in the questionnaires from the first day of class, a few students gave their heights in centimeters rather than inches. If these heights had not been converted, then our class dataset would have shown students over 12 feet tall. 22 Outliers • Outliers are often real, accurate pieces of data that are simply unusual. • For instance, most people work 35-40 hours per week. However a very small number work 70-80 hours a week. • It is sometimes tempting to remove outliers from a data set, but we must find out first whether or not the outlier is a legitimate observation or a mistake. 23 Dispersion (Spread) Here is a good piece of advice: “Do not cross a river if it is, on average, 4 feet deep” -Nassim Taleb, The Black Swan Why is this good advice? What additional information would we need before we decide if crossing the river is a good idea? 24 Dispersion (Spread) • Information about location (average or median) is not enough to adequately summarize a data set. • Sometimes the average doesn’t exist. For example, the average human being has one ovary and one testicle. • Information about how your data is dispersed is also useful, and is essential in inferential statistics. • We don’t just want to know where the center of our data lies; we also want to know how spread out the data is! 25 The Range • The range is the easiest measure of dispersion to compute. • It is the difference between the maximum value and the minimum value. • One problem with using the range is that it doesn’t tell you whether most of the data is spread out through the whole range, or if the maximum and minimum values are outliers. 26 The IQR • The inter-quartile range (Q3 – Q1) is not affected by extreme values since it is calculated using values that lie close to the center of the data set • We will not use either the range or the IQR when we move on to inferential statistics. But they are still useful as descriptive statistics. 27 Variance • The variance is another measure of dispersion. It is closely related to the standard deviation, which we will consider shortly. • Unlike the range or IQR, the variance statistic is computed using all of the data values in a data set. • It is sensitive to outliers, but the effects of extreme values are “diluted” if there are a large number of observations. 28 Sum of Squared Deviations • To compute the variance of a data set we first need a statistic called the sum of squared deviations • This is often abbreviated as SS, for “sum of squares” • To get the squared deviation for a single observation, subtract the mean from this observation, and then square the result. • Do this for all observations and sum the results. This gives us the sum of squared deviations. • Mathematically, S =∑ S( i x− 2 x) 29 Sum of Squared Deviations • Example: find the sum of squared deviations (SS) for our TV watching dataset: 0.9 S =∑ S( 1.4 i x− 1.2 2 1.2 1.3 2.0 1.4 1.6 x= ) 30 Sample Variance • The sample variance is denoted by the symbol s2 2 ( x − x) S∑ i • Mathematically, s 2 = S = n −1 −n 1 • The English interpretation of a variance is: “The average squared distance that a group of ‘n’ points lies from the mean of the group.” • This is not a very intuitive concept, though it is very often used in mathematical computations. 31 Sample Standard Deviation • The sample standard deviation is simply the square root of the sample variance. • It is denoted by the letter s • Continuing with our example, we have: s= 2 S S =s = n −1 32 Interpret the Standard Deviation • The standard deviation can be thought of roughly as an average distance that a group of points lies from the group mean. • A large standard deviation tells you that your data is highly dispersed, or spread out. • In inferential statistics, a large standard deviation signifies high levels of uncertainty regarding statistical inferences. • Note that what counts as “large” or “small” depends on the magnitude of the data itself. 33 Shapes of Distributions • You don’t need a histogram to determine the shape of a distribution. In fact, all you need are the values for the mean and the median of your data set. Frequency 9 8 Median= 92 7 6 5 4 3 Mean= 86 2 1 0 30 40 50 60 70 80 90 100 110 Grades 34 Shapes of Distributions • What is the shape of this distribution to the right? • Note that the mean is 86, and the median is 92 9 8 Median= 92 7 6 5 Mean= 86 4 3 2 1 0 30 40 0 50 60 70 80 90 100 110 0 35 Shapes of Distributions Median = .6 • What is the shape of this distribution to the right? 10 5 mean = 2.6 • Note that the mean is 2.6, and the median is 0.6 0 0 2 4 6 8 10 12 14 36 Shapes of Distributions • What is the shape of this distribution to the right? • Note that the mean is 102, and the median is 102 30 Mean=102 Median= 102 20 10 0 0 20 40 60 80 100 120 140 160 180 0 37 Mean, Median, & Shape • If the mean is greater than the median then the distribution is skewed to the right • If the mean is less than the median then the distribution is skewed to the left • If the mean and median are (approximately) equal then the distribution is (approximately) symmetric 38 Conclusion • A statistic is any number calculated from a set of data. Descriptive statistics are numbers that are used to describe important features of a data set. • The mean and median are very commonly used statistics which refer to location • The standard deviation is a very commonly used statistic which refers to dispersion. • In the next set of notes, we will look at probability and the normal distribution, which will lay the groundwork for understanding inferential statistics. 39