Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 4 Dustin Lueker The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets finer and finer Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate Symmetric distributions Not symmetric distributions: ◦ Bell-shaped ◦ U-shaped ◦ Uniform ◦ Left-skewed ◦ Right-skewed ◦ Skewed STA 291 Spring 2010 Lecture 4 2 Center of the data ◦ Mean ◦ Median ◦ Mode Dispersion of the data Sometimes referred to as spread ◦ Variance, Standard deviation ◦ Interquartile range ◦ Range STA 291 Spring 2010 Lecture 4 3 Mean ◦ Arithmetic average Median ◦ Midpoint of the observations when they are arranged in order Smallest to largest Mode ◦ Most frequently occurring value STA 291 Spring 2010 Lecture 4 4 Sample size n Observations x1, x2, …, xn Sample Mean “x-bar” x ( x1 x2 xn ) / n n 1 xi n i 1 SUM STA 291 Spring 2010 Lecture 4 5 Population size N Observations x1 , x2 ,…, xN Population Mean “mu” ( x1 x2 1 N N x i 1 i xN ) / N SUM Note: This is for a finite population of size N STA 291 Spring 2010 Lecture 4 6 Requires numerical values ◦ Only appropriate for quantitative data ◦ Does not make sense to compute the mean for nominal variables ◦ Can be calculated for ordinal variables, but this does not always make sense Should be careful when using the mean on ordinal variables Example “Weather” (on an ordinal scale) Sun=1, Partly Cloudy=2, Cloudy=3, Rain=4, Thunderstorm=5 Mean (average) weather=2.8 Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale STA 291 Spring 2010 Lecture 4 7 Center of gravity for the data set Sum of the differences from values above the mean is equal to the sum of the differences from values below the mean STA 291 Spring 2010 Lecture 4 8 Mean ◦ Sum of observations divided by the number of observations Example ◦ {7, 12, 11, 18} ◦ Mean = STA 291 Spring 2010 Lecture 4 9 Highly influenced by outliers ◦ Data points that are far from the rest of the data Not representative of a typical observation if the distribution of the data is highly skewed ◦ Example Monthly income for five people 1,000 2,000 3,000 4,000 100,000 Average monthly income = Not representative of a typical observation STA 291 Spring 2010 Lecture 4 10 Measurement that falls in the middle of the ordered sample When the sample size n is odd, there is a middle value ◦ It has the ordered index (n+1)/2 Ordered index is where that value falls when the sample is listed from smallest to largest An index of 2 means the second smallest value ◦ Example 1.7, 4.6, 5.7, 6.1, 8.3 n=5, (n+1)/2=6/2=3, index = 3 Median = 3rd smallest observation = 5.7 STA 291 Spring 2010 Lecture 4 11 When the sample size n is even, average the two middle values ◦ Example 3, 5, 6, 9, n=4 (n+1)/2=5/2=2.5, Index = 2.5 Median = midpoint between 2nd and 3rd smallest observations = (5+6)/2 = 5.5 STA 291 Spring 2010 Lecture 4 12 For skewed distributions, the median is often a more appropriate measure of central tendency than the mean The median usually better describes a “typical value” when the sample distribution is highly skewed Example ◦ Monthly income for five people 1,000 2,000 3,000 4,000 100,000 ◦ Median monthly income: Does this better describe a “typical value” in the data set than the mean of 22,000? STA 291 Spring 2010 Lecture 4 13 Mean - Arithmetic Average Mean of a Sample - x Mean of a Population - μ Median - Midpoint of the observations when they are arranged in increasing order Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi = Measurement of the ith unit Mode - Most frequent value. STA 291 Spring 2010 Lecture 4 14 Example: Highest Degree Completed Highest Degree Frequency Percentage Not a high school graduate 38,012 21.4 High school only 65,291 36.8 Some college, no degree 33,191 18.7 Associate, Bachelor, Master, Doctorate, Professional 41,124 23.2 Total 177,618 100 STA 291 Spring 2010 Lecture 4 15 n = 177,618 (n+1)/2 = 88,809.5 Median = midpoint between the 88809th smallest and 88810th smallest observations ◦ Both are in the category “High school only” Mean wouldn’t make sense here since the variable is only ordinal Median ◦ Can be used for interval data and for ordinal data ◦ Can not be used for nominal data because the observations can not be ordered on a scale STA 291 Spring 2010 Lecture 4 16 Mean ◦ Interval data with an approximately symmetric distribution Median ◦ Interval data ◦ Ordinal data Mean is sensitive to outliers, median is not STA 291 Spring 2010 Lecture 4 17 Symmetric distribution ◦ Mean = Median Skewed distribution ◦ Mean lies more toward the direction which the distribution is skewed STA 291 Spring 2010 Lecture 4 18 Disadvantage ◦ Insensitive to changes within the lower or upper half of the data ◦ Example 1, 2, 3, 4, 5 1, 2, 3, 100, 100 ◦ Sometimes, the mean is more informative even when the distribution is skewed STA 291 Spring 2010 Lecture 4 19 Keeneland Sales STA 291 Spring 2010 Lecture 4 20 The deviation of the ith observation xi from the sample mean x is the difference between them, ( xi x ) ◦ Sum of all deviations is zero ◦ Therefore, we use either the sum of the absolute deviations or the sum of the squared deviations as a measure of variation STA 291 Spring 2010 Lecture 4 21 Variance of n observations is the sum of the squared deviations, divided by n-1 s 2 (x i x) 2 n 1 STA 291 Spring 2010 Lecture 4 22 Observation Mean Deviation Squared Deviation 1 3 4 7 10 Sum of the Squared Deviations n-1 Sum of the Squared Deviations / (n-1) STA 291 Spring 2010 Lecture 4 23 About the average of the squared deviations ◦ “average squared distance from the mean” Unit ◦ Square of the unit for the original data Difficult to interpret ◦ Solution Take the square root of the variance, and the unit is the same as for the original data Standard Deviation STA 291 Spring 2010 Lecture 4 24 s≥0 ◦ s = 0 only when all observations are the same If data is collected for the whole population instead of a sample, then n-1 is replaced by n s is sensitive to outliers STA 291 Spring 2010 Lecture 4 25 Sample ◦ Variance s2 2 ( x i x ) n 1 ◦ Standard Deviation Population ◦ Variance 2 2 ( x i x ) s n 1 2 ( x i ) ◦ Standard Deviation N 2 ( x i ) N STA 291 Spring 2010 Lecture 4 26 Population mean and population standard deviation are denoted by the Greek letters μ (mu) and σ (sigma) ◦ They are unknown constants that we would like to estimate Sample mean and sample standard deviation are denoted by x and s ◦ They are random variables, because their values vary according to the random sample that has been selected STA 291 Spring 2010 Lecture 4 27 If the data is approximately symmetric and bell-shaped then ◦ About 68% of the observations are within one standard deviation from the mean ◦ About 95% of the observations are within two standard deviations from the mean ◦ About 99.7% of the observations are within three standard deviations from the mean STA 291 Spring 2010 Lecture 4 28 Scores on a standardized test are scaled so they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150 ◦ About 68% of the scores are between ◦ About 95% of the scores are between ◦ If you have a score above 1300, you are in the top % What percentile would this be? STA 291 Spring 2010 Lecture 4 29