Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
North Carolina State University STAT 370: Probabilityy and Statistics for Engineers [Section 002] Instructor: Hua Zhou Harrelson Hall 210 10:15AM-11:30AM, Jan 23, 2012 Announcement • HW2 (graphical summary) due Friday, Jan 27 @ 6pm • Any problem using StatCrunch? Lass class • Graphical summary: bar plot (categorical variable), pie chart (categorical variable), histograms (quantitative variable), scatter plot (two quantitative variables, time series) • Shape of distribution: symmetric, bell-shaped, left or right-skewed, unimodal, bimodal 1 Graphical Summary is Evolving! • Suppose you have collected a data set of life expectancy and income data for 200 countries in 200 years. • How would you summarize this data? • A video clip: Today • Numerical summary of data: mean/median/mode, variance, standard deviation, outliers • Graphical summary: Boxplot • http://www.open.ac.uk/openlearn/science-mathstechnology/mathematics-and-statistics/statistics/the-joy-stats-200countries-200-years-4-minutes Measuring center: the mean x • Mean = Average value • Sample mean x : for n observations x1 , x2 ,...xn their mean is x ( x1 x2 ... xn ) / n 1 n Measuring center: the median • • Median = middle value or center point Sample median: the number such that half of the observations are smaller than it and the other half are larger – the midpoint of a distribution x i 2 Procedure to calculate the median: M 1. Arrange all observations in order of size, from smallest to largest. 2. If the number of observations, n, is odd, the median M is the center observation in the ordered list. 3. If n is even, then M is the average of the two center observations in the ordered list. Mean vs Median • Consider two data sets {5, 8, 9, 10, 13} vs {5, 8, 9, 10, 68} – Both have the median equal to 9 – First has mean of 9, second has mean of 20 – Median is more robust to the outlier 68 Note: ((n+1)/2 ) is the location of the median,, not the median itself. Mean vs Median: Salary Survey of UNC Graduates • Survey a certain number of graduates from UNC. • A lot of departments are surveyed. • Question: – Which department produces students that earn the most on average 10 years after they got their degrees? • Answer: – Geography!!!!?????? – Michael Jordan Mean vs. Median • Mean: – easy to calculate – easy to work with algebraically – highly affected by outliers – not resistant to extreme observations • Median: – can be time consuming to calculate – more resistant to a few extreme observations (sometimes outliers) – robust 3 Mean, Median and Mode Mode: where is the peak • Important for categorical data • The most frequent value in the data • Possible to have more than one mode Mean – average Median – the middle data point Mode – where the peak(s) is (are) • If a unimodal distribution is exactly symmetric, then mean, median and mode are exactly the same. • If the distribution is skewed, the three measures differ. Remarks Which one to use? – If the histogram is symmetric, the mean is approximately equal to median. – If the histogram is right skewed, the mean is likely greater than the median. – If the histogram is left skewed, the mean is likely less than the median – The difference between the mean and the median is a rough measure of how severely skewed the data are. Mode Mean Median • Different by definition – Mean and median are unique, and only for quantitative variables. – Mode is not unique. – Mode is defined for categorical variables also. • The choice depends on the shape of the distribution, the type of data and the purpose of your study – Skewed: median – Categorical: mode – Total quantity: mean –… Mean Mode Median 4 Exercise Measure of Spread: Variance and Standard Deviation • For a random sample x1 , x2 , ...xn of size n, the sample variance/standard deviation are measures of spread n – Variance: s ( x i x ) / ( n 1) i 1 – Standard deviation = s = square root of variance (in same units as data) – If observed values are far apart, the variance and standard deviation will be large – Note variance (and standard deviation) are greater than or equal to 0. 0 Only 0 when all observed values are equal – Variance and standard deviation are strongly affected by outliers 2 2 • What’s the mode? Answer: 4 • What’s the median? Answer: 4 • What’s the mean? Answer: approx. 4.215 Standard deviation (cont’d) • Why sample “st. dev.” (SD) rather than sample “variance”? – S.D. is in the original scale – S.D. S D is natural for measuring spread for “normal” distributions • Why “n-1” rather than “n”? – Intuitively speaking, S.D. is not defined for n=1 – Sum of deviations is always y 0,, which means “if we know (n-1) of them, we know the last one” – Only (n-1) deviations can change freely – “n-1” – is referred to as degrees of freedom Example • Calculate the variance for a) First data set {5,8,9,10,13} 2 2 2 2 s2 59 89 99 (109)2 139 /4 =8.5, s.d.=2.9155 b) Second data set {5, 8, 9, 10, 68} 2 2 2 2 s 2 5 20 8 20 9 20 (10 20)2 68 20 / 4 =(225+144+121+100+2304)/4=723.5 s.d.=26.8980 5 Take Home Message • Graphical tools for quantitative data – Histograms – Boxplot • Examine distributions: – Shape – Symmetric or skewed – How many modes? – Bell-shaped – Outliers O tli • Numerical summary: Mean, median, mode, variance, standard deviation, Q1, Q3, IQR 6