Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Measures of the Middle 1. mean (or average)- is computed by adding all the data points and dividing by the total number of data points. - common notation for the mean is - x 1 other notation is x xi n - is not a resistant measure (not influenced by extreme observations) of center 2. median- is found by ordering all observations from smallest t o largest and picking the observation in the middle; if two observations are in the middle, then average the two to find the mean - notation is M for median - is a resistant measure of center - is the observation where 50% of the data is smaller and 50% of the data is larger 3. mode- is the observation that occurs most often - on a graph it would have the largest “bar” - a set of data could be bimodal (have two modes) or multimodal 4. midrange-the value in the middle of the range Comparing the mean and the median -Remember they are computed differently and thus represent different things. - If a distribution is completely symmetric, then the two will be the same. - In skewed distributions the mean is further along the tail than the median. - In skewed distributions it is better to use the median as a measure of the center Measures of Spread 1. range- is found by subtracting the smallest observation from the largest, it is not an interval it is a numerical value 2. standard deviation- is the measure of spread around the mean - notation is S for the standard deviation of a sample and for the standard deviation of an entire population 1 2 - is computed by S xi x n 1 where n-1 is the degrees of freedom 1 2 - is computed by xi x N where N represents the total number of observations in the population 3. variance- is computed from the standard deviation - the variance of a sample is S2 and the variance of a population is 2 4. interquartile range- is a measure of the spread of the middle 50% of the data - is calculated by subtracting the first quartile from the third, IQR=Q3-Q1 - Q1 is the median of all observations to the left of Q2 -Q3 is the median of all observations to the right of Q2 Five Number Summary Minimum Q1 M Q3 Maximum Boxplot - is a graph of the five number summary - Steps 1. use only one axis (draw it with a scale) 2. one end of the box is Q1 the other is Q3 3. place the minimum and maximum on the graph and draw lines connecting them to the box 4. draw a line through the box representing the median Note: You can also draw a boxplot with outliers represented as dots. -An outlier is found to be any point not within 1.5IQR of the quartiles. - Thus, any number outside of the range (Q1-1.5IQR, Q3 +1.5IQR) is an outlier. - The minimum point on your boxplot would then be next number bigger than the outlier. The maximum point would be the next number smaller than the outlier. What can go wrong? 1. Not making a reality check. Make sure your calculated summaries make sense. 2. Forgetting to sort the values before calculating the median or percentiles. 3. Computing numerical summaries for qualitative variables. 4. Not taking into account multimodal situations. If the data has gaps in it, then you may want to discuss each cluster separately and not give only one measure of center and spread. 5. Not scrutinizing outliers. If you have an outlier, check it out. It could be an error. If it is, toss it. If it’s not, use resistant measures of center and spread (median and IQR). 6. Not checking the picture (graph). If the data has outliers or is heavily skewed then maybe the mean and standard deviation aren’t the best measures to use. 7. Not taking into account their spreads when comparing different groups.