* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download When describing a distribution, one should, at a minimum, describe
Survey
Document related concepts
Transcript
Chapter 1 DISTRIBUTIONS When describing a distribution, one should, at a minimum, describe the spread, shape and outliers. So far we have done this with words. Now it is time to introduce numbers to aid in the description. The center of a distribution can be described by its mean or median. The mean or average of a set of observations is found by: adding their values and dividing by the number of observations. OR Barry Bonds Homeruns for 1987-2001 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 Calculate mean using STAT/CALC/1-Var Stats The 73 homeruns may be an outlier. Change the 73 to something more consistent with the other values (e.g., 35). Now recalculate the mean. What happened? The mean is NOT a resistant measure because it is effected by extreme measurements (outliers). In fact the mean is drawn toward them. The median M is the midpoint of a distribution. It is a number such that half the observations are smaller and the other half are larger. To find the median 1. Arrange all observations in order of size, from smallest to largest 1 Chapter 1 DISTRIBUTIONS 2. If the number of observations n is odd, the median M is the center observation in the ordered list 3. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list 123456 0123456 Now change the 6 to 100. Does this change the median? The median is a resistant measure, because it is not effected by extreme observations. The closer the distribution gets to a symmetrical shape the closer together the values of the median and the mean. The mean and median are identical for perfectly symmetrical distributions. Since the median is a resistant measure and the mean is not, the mean will be drawn toward extreme observations. For distributions that are skewed to the left, the mean will be less than the median. For distributions that are skewed to the right, the mean will be greater than the median The spread of a distribution can be described by the Interquartile Range (IQR) or the standard deviation (s) 2 Chapter 1 DISTRIBUTIONS The Quartiles Q1 and Q3 are calculated as follows 1. Arrange the observations in increasing order and locate the median M in the ordered list of observations 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median The INTERQUARTILE RANGE (IQR) is the distance between the first and third quartile and is calculated as Q3 – Q1 50% of the observations lie within the IQR. The IQR can be used to identify outliers. By definition an outlier is an observation that is Greater than Q3 + 1.5(IQR) or Less than 16 19 24 25 25 ↑ Q1 33 33 34 34 Q1 – 1.5(IQR) 37 37 ↑ M 40 42 ↑ Q3 46 49 73 A quick summary/description of the center and spread of a distribution can be given by the 5 number summary. The five number summary is Minimum Q1 M Q3 Maximum The 5 number is shown graphically in a box and whisker plot more typically referred to as simply a boxplot. The boxplot can be found under 2nd Y=/plot#/ Icon 5 3 Chapter 1 DISTRIBUTIONS Side-by-side boxplots comparing the number of homeruns per year by Barry Bonds and Hank Aaron A modified boxplot is a graph of the 5 number summary , with outliers identified using the IQR. In a modified boxplot The central box still spans the quartiles A line in the box still identifies the median Observations more than 1.5(IQR) outside the box are plotted individually The lines now extend from the box out to the smallest and largest observations THAT ARE NOT OUTLIERS To obtain a modified boxplot go to ICON 4 under StatPlot Regular (a) and modified (b) boxplots comparing the home run production of Barry Bonds and Hank Aaron 4 Chapter 1 DISTRIBUTIONS When using mean to describe the center of a distribution, standard deviation is a more appropriate measure of spread than median. The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations becomes OR The Standard Deviation, s, becomes: s measures spread about the mean and should be used only when the mean is chosen as the measure of center s=0 only when there is not spread. That is all observations have the same value. Otherwise, s > 0. As observations become more spread out about their mean, s gets larger s, like the mean x-bar, is not resistant, Strong skewness or a few outliers can make s very large. New York Yankee Roger Maris held the single-season home run record from 1961 until 1998. Here are his home run counts for his 10 years in the American League. 15 28 16 39 61 33 23 26 8 13 Describe the distribution 5 Chapter 1 DISTRIBUTIONS Compare the following two data sets. The first is the number of cesarean sections performed by 15 male doctors in Switzerland during one year. The second is the number of cesarean sections performed by 10 female doctors in Switzerland during the same year. Male Doctors 27 50 33 25 86 25 85 31 37 44 20 36 59 34 28 Female Doctors 5 7 10 14 18 19 25 29 31 33 Back-to-back stemplot of the number of cesarean sections performed by male and female Swiss doctors Which AP Exam is Easier: Calculus AB or Statistics??? The table below gives the distribution of grades earned by students taking the Calculus AB and Statistics AP exams in 2000. CalcAB Stat 5 16.8% 9.8% 4 23.2% 21.5% 3 23.5% 22.4% 2 19.6% 20.5% 1 16.8% 25.8% 6 Chapter 1 DISTRIBUTIONS The 2 distributions are roughly similar for grades 2,3, & 4. A larger proportion of Statistics students received a grade of 5. This suggests that the Statistics exam is harder. At the very least it indicates that students who take the Statistics exam get poorer grades than students who take the Calculus exam. 7