Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2.1 – Overview ►►►Read the book, complete the notes, read the chapter problem and all examples in the chapter. In this chapter we present a variety of basic tools that will help us in understanding a collected data. We will describe, explore and compare data sets. Two General Divisions of Statistics Descriptive: to summarize or describe characteristics of a set of data pictorially, numerically, or by tabulation. Inferential: when we use sample data to make generalizations and/or predictions about a population. Examples of Descriptive Statistics 1) The average SAT score for a certain College is 513.5 2) The final exam grades for my statistics class in the Fall 2003 ranged from 23% to 99% Examples of Statistical Inference We might infer from appropriate samples that: 1) Between 20% and 25% of American college students are married. 2) High cholesterol levels are associated with increased risk of heart disease The same number may be used for either describing a smaller distribution or making inferences about a larger distribution: 1) Nielsen reports that 24.7% of those who were interviewed watched the President’s news conference last Sunday night. 2) Probably about 24.7% of all television viewers watched the President’s news conference last Sunday night. 3) The average age of students enrolled in this class is 19.7 years 4) The average age of students enrolled at this college is probably 19.7 years 1 Important Characteristics of Data (CVDOT) 1) Center: A representative or “average” value that indicates where the middle of the data set is located. 2) Variation: A measure of how spread out the data values are. 3) Distribution: The shape of the spread of the data. A distribution could be bell-shaped, uniform, skewed, etc. 4) Outliers: Sample values that lie very far away from the vast majority of the other sample values. (Possibly due to errors or unusual circumstances.) 5) Time: Changing characteristics of the data over time 2 2.2 – Frequency Distributions We’ll use the table from the next page to introduce the vocabulary. • Frequency distribution, classes, frequency • Advantage of a frequency distribution: makes a list more intelligible • Disadvantage of a frequency distribution: original data is lost • Lower and upper Class Limits • Class Boundaries upper limit of one class + lower limit of the next class 2 • Class Midpoints Within a class do: lower class limit + upper class limit 2 • Class Width Difference between 2 consecutive lower class limits • Relative Frequency It is a percentage or fraction. class. frequency = usually as % sum.of . frequencies • Cumulative Frequency Sum of frequencies at and below a given class. 3 ►►►Frequency Distribution of the Systolic Blood Pressure of Women (#2, page 44) Systolic Blood Pressure of Women 80-99 100-119 120-139 140-159 160-179 180-199 Frequency Relative Frequency 9 24 5 1 0 1 a) List the lower class limits b) List the upper class limits c) List the class boundaries d) List the class midpoints e) What is the class width? f) Construct the relative frequency distribution (complete column on table above) g) Construct the cumulative frequency distribution Systolic Blood Pressure of Women Less than Less than Less than Less than Less than Less than Cumulative Frequency h) What is the number of women in the sample? 4 Guidelines for Constructing a Frequency Distribution 1. Be sure that classes are mutually exclusive. 2. Include all in between classes, even if the frequency is zero. 3. Use the same width for all classes. Sometimes open-ended interval are impossible to avoid for first and/or last class. 4. Use between 5 and 20 classes. Usually 10 or fewer. 5. The sum of the class frequencies must equal the number of original data values. Constructing a Frequency Distribution ►►► Example -Days to Maturity for Short Term Investments The following table displays the number of days to maturity for 40 short-term investments. The data are from Barron’s National Business and Financial Weekly. 70 62 75 57 51 64 38 56 53 36 99 67 71 47 63 55 70 51 50 66 64 60 99 55 85 89 69 68 81 79 87 78 95 80 83 65 39 86 98 70 STEPS: (1) Decide on the number of classes (5 to 20). In this example we’ll use 7 classes. (2) Find the class width = (highest.value) (lowest.value) number.of .classes ROUNDED UP to a convenient number. (3) Select the starting point or lower limit of the first class (lowest score or convenient value lower than the lowest score). In this example we’ll use the lowest score as the lower limit of the first class. 5 (4) Proceed finding all other lower limits by adding the class width, and display them vertically Classes Tally Frequencies Relative Frequencies Cumulative Frequencies (5) Find the upper class limits (6) Tally each score in the appropriate class and find the total frequency for each class. ►►►Now do each of the following: a) List the class boundaries b) List the class midpoints c) Construct the relative frequency distribution (4th column above) d) Construct the cumulative frequency distribution (last column above) Interpreting Frequency Distributions Using frequency distributions to describe, explore, and compare data sets ►►►Read the 3 examples on pages 42 and 43, and refer to tables 2-4 to 2-7 on page 43. 6 2.3 – Visualizing Data Histograms • Horizontal axis: values of the data Use class boundaries for marks along the horizontal axis. • Vertical axis: frequencies. (the vertical height of the histogram should be about three-fourths of the total width) • The height of a bar represents the frequency of each class. • Both axes should be clearly labeled. • Note: We cannot reconstruct the original data set from a histogram and have sacrificed some accuracy for convenience in displaying the data. • Remember to Interpret the histogram referring to the characteristics of data CVDOT from section 2.1 ►►►Construct a histogram for the “Days to Maturity” data from section 2.2, page 4 of notes. Here is the frequency table constructed on page 5 of notes: Days to maturity 36-45 46-55 56-65 66-75 76-85 86-95 96-105 Number of Investments 3 7 8 9 6 4 3 Relative Frequency Histogram • Use the relative frequencies along the vertical axis • Shape should be the same as the regular histogram with the vertical axis labeled differently. ►►►Construct a relative frequency histogram for the “Days to Maturity” data 7 Using the TI_83 to Sketch a Histogram (1) Using Raw Data ►►►We’ll use the raw data “Days to Maturity” from section 2.2, page 4 of notes. 70 62 75 57 51 64 38 56 53 36 99 67 71 47 63 55 70 51 50 66 64 60 99 55 85 89 69 68 81 79 87 78 95 80 83 65 39 86 98 70 • Step 1: Enter data in calculator into one of the L’s SINCE WE’LL BE USING THE DATA FOR DIFFERENT PROBLEMS WE ARE GOING TO CREATE A NEW LIST Arrow up and to the right of L6, type the name of your list, Call it DAYS, press ENTER, and type in the data • Step 2: Clear or inactivate any functions defined in Y= • Step 3: Define the statistical plot desired: 2nd Y= [STAT PLOT] 1: Plot1 ENTER to turn plot on Arrow down and right to histogram, ENTER to select it Xlist: L1 (or other L used) To select the created list DAYS, do the following 2nd STAT (to access the LIST menu) now select DAYS by pressing ENTER next to the name Frequency: 1 (To change to 1, do ALPHA 1) • Step 4: Let the calculator select a window by pressing ZOOM 9 Press TRACE and arrow to the right to read the classes and frequencies The classes are probably different from the ones produced in the notes page 6. To produce the same classes as your frequency table do the following after step 3: • Step 5: Select a window by pressing WINDOW Xmin = lower limit of the first class Xmax = lower limit of the next class beyond the data (xmin + number of classes * class width) Xscl = class width Ymin = 0 which is the lowest possible frequency (or –10 to provide space for writing as we trace) Ymax = larger than the highest frequency • Step 6: Press GRAPH To display information on the screen press TRACE and arrow to the right. • Step 7: 2nd • Step 8: To turn stat plot off: 2nd Y= [STAT PLOT] , ENTER on plot 1, ENTER on Off or: Y=, up to Plot1, ENTER to de-select QUIT will return to the Home Screen. 8 (2) Using Grouped Data ►►►Use the frequency table for the data “Days to Maturity” Days to maturity Class Midpoints 36-45 46-55 56-65 66-75 76-85 86-95 96-105 Number of Investments 3 7 8 9 6 4 3 • Step 1: Clear Lists 2 and 3 (or any other Lists) • Step 2: Enter the class midpoints in L2. • Step 3: Enter class frequencies in L3. • Step 4: Clear or inactivate any functions defined in Y= • Step 5: Define the statistical plot desired: 2nd Y= [STAT PLOT] 1: Plot1 ENTER to turn plot on Arrow down and right to histogram, ENTER to select it Xlist: L2 (or other L being used) Frequency: L3 (or other L being used) • Step 6: Select a window by pressing WINDOW Xmin = lower limit of the first class Xmax = lower limit of the next class beyond the data (xmin + number of classes * class width) Xscl = class width Ymin = -10 to provide space when tracing Ymax = larger than the highest frequency • Step 7: GRAPH To display information on the screen press TRACE and arrow to the right. Make sure the info agrees with the frequency table. • Step 8: 2nd • Step 9: To turn stat plot off: 2nd Y= [STAT PLOT] , ENTER on plot 1, ENTER on Off or: Y=, up to Plot1, ENTER to deselect QUIT will return to the Home Screen. 9 Frequency Polygon Construction • Plot the points with coordinates (class midpoint, class frequency) • Connect points with line segments • Extend the first and the last segments to the left and right so that the graph begins and ends on the x-axis ►►►Construct a frequency polygon for the “Days to Maturity” data Days to Midpoints Frequency Maturity 36-45 3 46-55 7 56-65 8 66-75 9 76-85 6 86-95 4 96-105 3 Ogive Construction • Plot the points with coordinates (upper class boundaries, cumulative frequency) • Connect points with line segments. • The graph begins on the x-axis with the lower boundary of the first class and ends with the upper boundary of the last class (must start at the 0% and end at 100%). • Ogives are useful for determining the number of values below some particular value. ►►►Construct an ogive for the “Days to Maturity” data Upper Cumulative Class Frequency Boundaries 10 Dot Plots (see page 48, figure 2-5) • Each data value is plotted as a point (or dot) along a scale of values. • Numbers appear individually not in categories as it happens in a histogram. • Stack the values vertically when values occur more than once. • Similar to histograms because we can see the distribution of the data • We do not loose the particular values ►►►Construct a dot plot for the “Days to Maturity” data 11 Stem and Leaf Plots (see page 49) • Similar to histograms because we can see the distribution of the data • We do not lose the particular data values • STEM (consists of the leftmost digit(s)) • LEAF (consists of the rightmost digit) • Examine sidewise and see a histogram • The number of stems should be kept between 5 and 20 • If there are too many values, expand, subdividing rows into: digits from 0 to 4 and digits from 5 to 9 • If necessary, condense, that is reduce the number of rows • Since it displays the data in order, it is a fast and easy procedure for ranking data (arranging data in order) ►►►Construct a Stem-and-leaf plot for the “Days to Maturity” data 12 Pareto Charts (see page 51) • It is a bar graph for qualitative data • Bars are arranged in descending size • Vertical scale can represent frequencies or relative frequencies as in the histogram ►►► Do problem # 21, on page 57 Pie Charts (see page 51) • Used to display qualitative data in a more understandable way so that we see what part of the total data is represented by each category. • Make a table with a column with relative frequencies (%), and a column for degrees (% of 360) ►►► Do problem # 23, on page 58 Scatter Diagrams (see page 51) • Is a plot of paired (x,y) data with a horizontal x-axis, and a vertical y-axis. • The pattern of the plotted points is often helpful in determining the presence and form of some relationship between the two variables. ►►►We’ll do scatter diagrams with more detail later in chapter 9 Time-Series Graph (see figure 2-8, page 52) • Time-series data are data that have been collected at different points in time. ►►► Do problem # 27, page 58 Enter YEAR in L1, Enter STOCK VALUE in L2, Use the 2nd graph in the STAT PLOT menu to graph. Other Graphs ►►►Discuss the graphs on pages 53, and 54 13 2.4 – Measures of Center Measure of Center: Value representing some type of measure of the center or middle of a data set Mean (arithmetic mean) • The sum of the scores divided by the number of scores • The mean is affected by low or high values Notation: n: number of values in a data sample N: number of values in a data population Sample mean Population mean x x n x N ►►►Do problem #3, on page 69 Median • Middle value when data is arranged in ORDER • If n is odd, the median is located exactly in the middle • If n is even, it is the mean of two middle numbers • Median is not affected (is resistant) to large or small data values (is “robust”) ►►►Do problem #3, on page 69 14 Mode • The most frequent score or class • Sometimes a data set can be bimodal, multimodal, or have no mode ►►►Do problem #3, on page 69 Midrange • Value midway between highest and lowest data values . highest.value lowest.value 2 ►►►Do problem #3, on page 69 Round-off Rule • Carry one more decimal place than is present in the original set of data values • Round off only on the final answer. Keep several more decimal places during intermediate calculations. Computing the Mean on the TI-83 Non-grouped data or RAW data Step 1: Clear List 1 Step 2: Enter the data in L1. Step 3: 2nd QUIT to go to the home screen STAT CALC 1:1-Var Stats L1 ENTER Note: If data is in another L, you must specify which L. Otherwise the calculator always assumes L1. ►►► Find the mean of the DAYS TO MATURITY DATA that you have in your calculator in the list named DAYS 15 Mean From a Frequency Distribution To find the mean of data summarized in a frequency distribution we use the formula: x ( f x) f , where f denotes the frequency and x represents the class midpoint. ►►► Example: Use the frequency distribution from the “Days to Maturity” example to find the mean of the days to maturity of those 40 short term investments. Days to maturity Class Midpoints x Number of Investments f 36-45 46-55 56-65 66-75 76-85 86-95 96-105 3 7 8 9 6 4 3 Computing the Mean on the TI-83 Grouped Data Step 1: Clear Lists 5 and 6 Step 2: Enter the class midpoints in L5 Step 3: Enter class frequencies in L6. Step 5: 2nd QUIT to go to the home screen STAT CALC 1:1-Var Stats L5, L6 ENTER ►►► Use the calculator to find the mean for the grouped data of the “Days to Maturity” example. Now compare with the answer that you got on page 14 when using raw data. Explain the differences. 16 Weighted Mean Is the mean computed with the different scores assigned different weights. x ( w x) w ►►►Find the grade-point average of a student who took five classes and got the following grades: A (in a 3-credit course), B (in a 5-credit course), A(in a 2-credit course), C (in a 4-credit course), and A (in a 3-credit course). Remember that A = 4, B = 3, C = 2, D = 1, F = 0. The Best Measure of Central Tendency See table 2-10 on page 67. Skewness (see page 68) A non-symmetric distribution that extends more to one side than another Skewed to the left (negatively skewed, lopsided to the right) The histogram is much lower on the left side and the mean is left of the median which is left of the mode. Skewed to the right (positively skewed, lopsided to the left) The histogram is much lower on the right side and the mean is right of the median which is right of the mode. Symmetric (zero skewness, data not lopsided) The histogram is mirror image about the data center, and the mean = the median = the mode 17 2.5 Measures of Variation (Dispersion) ►►►Refer to problem # 9 on page 70. Complete the table given below and construct a dot plot for each of the distributions (data is shown below) The waiting times of customers (in minutes) is listed below for both banks. Bank J 6.5 Single waiting line 6.6 6.7 6.8 7.1 7.3 7.4 7.7 7.7 7.7 Bank P 4.2 5.4 Multiple waiting lines 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0 ►►►Complete the following table: Mean Median Mode Bank J Bank P ►►► Show the dot plot for both banks here: Range Range = (Highest value) – (Lowest value) Only affected by 2 numbers (does not represent the whole data set) ►►► Find the range of the two distributions shown above Standard Deviation Is a measure of the average variation of values about the mean ►►► In your opinion, considering the two distributions given on problem #9, page 70, which data set has the smallest standard deviation? Variance Is the square of the standard deviation. 18 Notation s = Sample standard deviation s 2 = Sample variance = Population standard deviation 2 = Population variance Formulas to find the Standard Deviation Defining Formula Sample Population s ( x x) Shortcut formula 2 s n 1 (x ) n ( x 2 ) ( x ) 2 n(n 1) 2 N Finding the Standard Deviation ►►►Find the standard deviation for the waiting times at the Jefferson Valley Bank. (problem #9, page 70) 19 ►►►Follow the same procedure to verify that for the Bank of Providence, s = 1.82 min. Display your work in a table: Grouped Data- Finding the Standard Deviation from a Frequency Distribution When dealing with grouped data we will use the calculator to find the standard deviation instead of using the formula given on page 80. Computing the Standard Deviation on the TI-83 Non-grouped data Step 1: Clear a list (L1) Step 2: Enter the data into the list (L1) Step 3: 2nd QUIT to go to the home screen STAT CALC 1:1-Var Stats L1 ENTER ►►► a) Check the answer obtained for the waiting times at the Jefferson Valley Bank. Enter data in L1. b) Check answer obtained for the waiting times at the Providence Bank. Enter data in L2. c) Find the standard deviation for the Days to Maturity data which is stored in the list DAYS Grouped Frequency Table Values Step 1: Clear Lists 2 and 3 Step 2: Enter the class midpoints in L2 Step 3: Enter class frequencies in L3. Step 5: 2nd QUIT to go to the home screen STAT CALC 1:1-Var Stats L2, L3 ENTER ►►► Find the standard deviation for the grouped data of the Days to Maturity example (refer to page 16 of notes) 20 Comparing Variation in Different Populations The coefficient of variation (or CV) for a set of sample or population data, expressed as a percent, describes the standard deviation relative to the mean. • It is a measure of the importance of the data set’s variation. CV s 100% x CV 100% ►►► Read example on page 80 ►►► Find the coefficient of variation for a) Days to Maturity data b) Waiting times at the Jefferson Bank c) Waiting times at the Providence Bank 21 Interpreting and Understanding Standard Deviation • It measures the average variation among scores of a data set • A data set with many scores close together yields a small standard deviation • A data set with scores spread farther apart yields a larger standard deviation • It is a kind of yardstick by which we can compare one set of data with another • Range rule of thumb: range ~ 4 standard deviations (r = 4 s) s ~ range / 4 • Values that are within 2 standard deviations from the mean are considered “usual” values • Minimum “usual” value ~ mean – 2 standard deviations • Maximum “usual” value ~ mean + 2 standard deviations • Most of the data is within the interval: [(mean – 2 standard deviations) , (mean + 2 standard deviations)] • Values that are more than two standard deviations away from the mean are considered “unusual” values. ►►►If the range of a certain distribution is 17, approximate the standard deviation ►►►Do #24, page 90 ►►►Identify usual and unusual values in a) Days to Maturity data b) Waiting times at the Jefferson Bank c) Waiting times at the Providence Bank 22 Empirical Rule (or 68-95-99.7 rule) If a Distribution is approximately bell shaped, then • About 68% of the scores fall within 1 standard deviation of the mean • About 95% of the scores fall within 2 standard deviations of the mean • About 99.7% of the scores fall within 3 standard deviations of the mean ►►►For the Days to Maturity data, a) Find the values x s , and x s How many of the 40 pieces of data have values within 1 standard deviation of the mean? What percentage of the sample is this? b) Find the values x 2 s and x 2 s How many of the 40 pieces of data have values within 2 standard deviation of the mean? What percentage of the sample is this? c) Find the values x 3s and x 3s How many of the 40 pieces of data have values within 3 standard deviation of the mean? What percentage of the sample is this? d) Compare to the results predicted by the empirical rule. Does the result suggest an approximately normal distribution? 23 Chebyshev’s Theorem • It applies to any set of data, but its results are very approximate The Proportion of data lying within k standard deviations of the mean is at least 1 1 , where k is any positive integer greater than 1. k2 • At least 3/4 or 75% of all scores fall within 2 standard deviations of the mean 1 1 22 • At least 8/9 or 89% of all scores fall within 3 standard deviations of the mean 1 1 32 • At least ......... or .......... of all scores fall within 4 standard deviations of the mean • For typical data sets, it is unusual for a score to differ from the mean by more than 2 or 3 standard deviations ►►►For the Days to Maturity data, using Chebyshev’s theorem, we can say that a) At least 89% of the data are within.......................... b) At least 75% are within.................................... ►►►According to Chebyshev’s theorem, “At least 75% of the data fall within two standard deviations for the mean” which is equivalent to stating that “At most 25 percent will be more than two standard deviations away from the mean” Answer the following: a) At most what percent of a distribution will be three or more standard deviations from the mean? b) At most what percent of a distribution will be four or more standard deviations from the mean? ►►►Do #28, pg. 90 24 2.6 Measures of Relative Standing z-Scores (or Standard Score) • The z-score is the number of standard deviations that a given data value is above or below the mean. A score with a positive z-score is above the mean and a score with a negative z-score is below the mean. • Z-scores enable us to standardize values so that they can be compared ►►►If you score 81 on your first exam, where 46 on your second exam, where x x = 75 and s = 16, and you score = 40 and s = 8, which score is relatively better? ___|___________|___________________|___ 75 81 x score 91 x +s ___|_______________________|_______|____ 40 46 x score 48 x +s From the diagram we can say that score = x + ? s The number of standard deviations above or below the mean is called z score. It makes sense to use the formula: x= x + z s then z x xx or z s Round z to 2 decimal places Note: We are using the standard deviation as a “yard stick”. ►►► Find the z-scores on the problem given above. ►►► Do problem #5, on page 100 25 • Z-scores can be used to differentiate between ordinary values and unusual values • Values with z-scores within [-2,2] are considered "ordinary" or "usual" • Values with z-scores greater than 2, or less than -2, are considered "unusual" FIGURE 2-14 ►►► Do problem #6, on page 100 Quartiles and Percentiles • Quartiles: Divide ranked data into 4 equal parts. (Q1, Q2, Q3) (Similar to median which divides into 2) • Percentiles: Divide ranked data into 100 parts. (P1, P2,...P99) A score in the 88th Percentile means: Student's score is higher than 88% of the scores Finding the percentile corresponding to a particular score Percentile = number.of .values.less.than.x • 100 total.number.of .values ►►►From the table 2-13 on page 95: Find the percentile corresponding to 245. 26 Reverse Process - Finding the score corresponding to a particular percentile What scores is at kth percentile? (1) Rank the data from lowest to highest (2) Find % of the total number = L (Locator) L nk 100 a) If L is not a whole number, round up and find the score in that position b) If L is a whole #, find the average of the scores in positions L and L+1 ►►►Using the data from table 2-13, page 95 Find each one of the following: a) P40 = b) Q1= c) Q2 = P50 = d) Q3 = P75 = e) P27 = Interquartile Range A statistic that we will use in the next section is defined in terms of quartiles. It is the Interquartile Range, or IQR. It measures the spread of the middle 50% of the data. IQR= Q3 – Q1 ►►► What is the IQR for the data in Table 2-13, page 95? 27 2.7 Exploratory Data Analysis - EDA Exploratory Data Analysis is the process of using statistical tools (such as graphs, measures of center, and variation) to investigate data sets in order to understand their important characteristics. Outliers (extreme values) Outliers are values that are very far away from almost all of the other values. • An outlier can have a dramatic effect on the mean. • An outlier can have a dramatic effect on the standard deviation. • An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured. (See example page 102) Box Plots Graphs that reveal central tendency, the spread of the data, the distribution of the data, and the presence of outliers (extreme scores) • Do not show as much detailed information (as histograms or stem-and-leaf plots) • Very useful for comparing two or more data sets (use the same scale) • Used to identify the approximate shape of the distribution of a large data set. • For small data sets, boxplots can be unreliable in identifying distribution shape. (using stem and leaf plots or dot plots is more appropriate in this case) Steps for Constructing a Box Plot (1) ARRANGE data in Ascending Order (2) Find the 5-number summary Minimum value First quartile: Q1, which is the median of the observations which are to the left of the overall median Median: Q2 Third quartile: Q3, which is the median of the observations which are to the right of the overall median) Maximum value (3) Use these numbers to construct the box plot. 28 ►►►Construct the box plot for the data set. Here we have a sample of 15 incomes of college graduates, arranged in increasing order. The incomes are given in thousands. 4, 25, 30, 30, 30, 31, 32, 35, 50, 50, 50, 55, 60, 74, 110 ►►►Construct the box plot for the data set. We are given below the earnings in 2001 of 16 randomly chosen people who have high school diplomas but no college. For convenience we have arranged the incomes in increasing order. The incomes are given in thousands. 5, 6, 12, 19, 20, 21, 22, 24, 25, 31, 32, 40, 43, 43, 47, 67 29 Constructing a Box Plot with the TI-83 There are two choices of box plots within a STAT PLOT The 4th one is the modified box plot and shows outliers, if there are any. Step 1: Step 2: Step 3: Step 4: Clear L1. Enter data in L1. (or L1 and L2 if grouped data) Define the statistical plot desired: 2nd Y= [STAT PLOT] 1: Plot1 ENTER to turn plot on Arrow down and right to the 4th graph icon , ENTER to select it. Xlist: L1 Freq: 1 (or L2, if grouped data) Select a window by pressing ZOOM, and selecting 9:ZoomStat ►►► The following table gives the number of public High School graduates, in thousands, for 15 states in 1980 and 1990. Use the calculator to construct the boxplots for the data. Compare the two data sets. State FL PA MI OH AL WY NJ ME LA IL WI NY AZ TN SD 1980 87.3 146.5 124.3 144.2 45.2 6.1 94.6 15.4 46.3 135.6 69.3 204.1 38.6 49.8 10.7 1990 88.9 110.5 93.8 114.5 40.5 5.8 69.8 13.8 36.1 108.1 52 143.3 32.1 46.1 7.7 Box Plots and Distributions Figure 2-17 on page 105. ►►►Construct the box plot for the Days to Maturity data set. 30 Math 116 – Chapter 2 Highlights Choosing an appropriate number to describe the data Measuring the center of a distribution The mean cannot resist the influence of extreme observations. It is not a resistant measure of the center The median is a resistant measure of the center (not affected by extreme observations). If the distribution is symmetric, the mean and median are the same. If the distribution is close to symmetric, the mean and median are very close in values. In a skewed distribution, the mean is farther out in the direction of skewness (long tail) than is the median Reports about incomes and other strongly skewed distributions usually give the median rather that the mean. Example 1: Distributions of incomes are usually skewed to the ............... Which measure of the center is more appropriate? Why? Example 2: The mean and median selling price of existing single-family homes sold in June 2002 were $163,900 and $210,900. Which of these numbers is the mean and which is the median? Explain how you know. Example 3: A class of 9th grade students takes a test designed for 6th graders. The mean and median scores are 83% and 87%. What direction is the skewness of the test scores, and which number is the mean?, median? 31 Measuring the spread of a distribution The minimum and maximum values show the full spread of the data (but they may be outliers. Also, the spread of the in-between numbers is ignored.) The quartiles mark the spread of the middle half of the data as well as the spread of the upper and lower 25% of the data. Box Plots – Five Number Summary - In a symmetric distribution, the first and third quartiles are equally distant from the median - In most distributions that are skewed to the right, the third quartile will be farther above the median than the first quartile is below it. The standard deviation measures the spread of the data by looking at how far the observations are from their mean and averaging those values. Choosing measures of center and spread The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use the mean and standard deviation only for reasonably symmetric distributions that are free of outliers. 32