Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Copyright © 2004 Pearson Education, Inc. Chapter 2 Descriptive Statistics Describe, Explore, and Compare Data 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative Standing 2-7 Exploratory Data Analysis Copyright © 2004 Pearson Education, Inc. Section 2-1 Overview Created by Tom Wegleitner, Centreville, Virginia Copyright © 2004 Pearson Education, Inc. Overview Descriptive Statistics Describe the important characteristics of a set of data. Organize, present and summarize data: 1. Graphically 2. Numerically Copyright © 2004 Pearson Education, Inc. Important Characteristics of Quantitative Data “Shape, Center, and Spread” 1. Center: A representative or average value that indicates where the middle of the data set is located 2. Variation: A measure of the amount that the values vary among themselves 3. Distribution: The nature or shape of the distribution of data (such as bell-shaped, uniform, or skewed) Copyright © 2004 Pearson Education, Inc. Section 2-2 and 2-3 Frequency Distributions And Visualizing Data Created by Tom Wegleitner, Centreville, Virginia Copyright © 2004 Pearson Education, Inc. Frequency Distributions And Histograms Frequency Distribution Table that organizes data values into classes along with the number of data values that fall in each class (frequency, f). 1. Ungrouped Frequency Distribution – for data sets with few different values. Each value is in its own class. 2. Grouped Frequency Distribution: for data sets with many different values, which are grouped together in the classes. Copyright © 2004 Pearson Education, Inc. Ungrouped Frequency Distributions Number of Peas in a Pea Pod Sample Size: 50 5 5 4 6 4 3 7 6 3 5 6 5 4 5 5 6 2 3 5 5 5 5 7 4 3 4 5 4 5 6 5 1 6 2 6 6 6 6 6 4 4 5 4 5 3 5 5 7 6 5 Peas per pod Freq, f Copyright © 2004 Pearson Education, Inc. Peas per pod Freq, f 1 1 2 2 3 5 4 9 5 18 6 12 7 3 Frequency Histogram A bar graph that represents the frequency distribution of a data set. It has the following properties: 1. Horizontal scale is quantitative and measures the data values. 2. Vertical scale measures the frequencies of the classes. 3. Consecutive bars must touch. Copyright © 2004 Pearson Education, Inc. Frequency Histogram Ex. Peas per Pod Freq, f 1 1 2 2 3 5 4 9 Number of Peas in a Pod 20 15 Freq, f Peas per pod 10 5 5 18 0 6 12 1 2 3 4 5 Number of Peas 7 3 Copyright © 2004 Pearson Education, Inc. 6 7 Relative Frequency Distributions and Relative Frequency Histograms Relative Frequency Distribution Shows the proportion (or percentage) of data values that fall into each class relative frequency: rf = f/n Relative Frequency Histogram Has the same shape and horizontal scale as a histogram, but the vertical scale is marked with relative frequencies. Copyright © 2004 Pearson Education, Inc. Relative Frequency Histogram Has the same shape and horizontal scale as a histogram, but the vertical scale is marked with relative frequencies. Figure 2-2 Copyright © 2004 Pearson Education, Inc. Grouped Frequency Distributions Group data into 5-20 classes of equal width. Exam Scores Freq, f 30-39 1 40-49 0 50-59 4 60-69 9 70-79 13 80-89 10 90-99 3 Copyright © 2004 Pearson Education, Inc. Definitions Lower class limits: are the smallest numbers that can actually belong to different classes Upper class limits: are the largest numbers that can actually belong to different classes Class width: is the difference between two consecutive lower class limits or two consecutive lower class boundaries Class midpoints: the value halfway between LCL and UCL Class boundaries: the value halfway between an UCL and the next LCL Copyright © 2004 Pearson Education, Inc. Constructing a Grouped Frequency Table 1. Calculate the range of values to span the set: Range = Hi – Low. (May round up) 2. Decide on the number of classes (should be between 5 and 20) . 3. Calculate class width: (May round up) 4. Choose the 1st LCL (less than or equal to smallest value) 5. Write all LCLs by adding the class width. 6. Enter all the UCLs. 7. Find the frequencies for each class. class width (highest value) – (lowest value) number of classes Copyright © 2004 Pearson Education, Inc. “Shape” of Distribution Symmetric Data is symmetric if the left half of its histogram is roughly a mirror image of its right half. Skewed Data is skewed if it is not symmetric and if it extends more to one side than the other. Uniform Data is uniform if it is equally distributed (on a histogram, all the bars are the same height). Copyright © 2004 Pearson Education, Inc. Shape Figure 2-11 Copyright © 2004 Pearson Education, Inc. Outliers are “unusal” data values as compared to the rest of the set. They may be distinguished by gaps in a histogram. Outliers Copyright © 2004 Pearson Education, Inc. Other Graphs Besides histograms, there are other ways to graph quantitative data: 1. Stem and Leaf plots 2. Dot plots 3. Time Series Copyright © 2004 Pearson Education, Inc. Stem-and Leaf Plot Represents data by separating each value into two parts: the stem (such as the leftmost digit) and the leaf (such as the rightmost digit) Copyright © 2004 Pearson Education, Inc. Dot Plot Consists of a graph in which each data value is plotted as a point along a scale of values Figure 2-5 Copyright © 2004 Pearson Education, Inc. Time-Series Graph Data that have been collected at different points in time. Figure 2-8 Ex. www.eia.doe.gov/oil_gas/petroleum/ Copyright © 2004 Pearson Education, Inc. Qualitative Data The two most common graphs for qualitative data are: 1. Pareto Charts (Bar charts) 2. Pie Charts Copyright © 2004 Pearson Education, Inc. Pareto Chart A bar graph for qualitative data, with the bars arranged in order according to frequencies Figure 2-6 Copyright © 2004 Pearson Education, Inc. Pie Chart A graph depicting qualitative data as slices pf a pie Figure 2-7 Copyright © 2004 Pearson Education, Inc. Section 2-4 Measures of Center Created by Tom Wegleitner, Centreville, Virginia Copyright © 2004 Pearson Education, Inc. Measures of Center Measure of Center Number representing a “typical” or central value of a data set. An “average”. There are 4 common “averages”: 1. Mean 2. Median 3. Mode 4. Midrange Copyright © 2004 Pearson Education, Inc. The Mean Mean: the measure of center obtained by adding the values and dividing the total by the number of values. Copyright © 2004 Pearson Education, Inc. Notation denotes the addition of a set of values x is the variable usually used to represent the individual data values n represents the number of values in a sample N represents the number of values in a population Copyright © 2004 Pearson Education, Inc. Notation x is pronounced ‘x-bar’ and denotes the mean of a set of sample values x x = n µ is pronounced ‘mu’ and denotes the mean of all values in a population µ = x N Copyright © 2004 Pearson Education, Inc. Round-off Rule for Measures of Center Carry one more decimal place than is present in the original set of values. Copyright © 2004 Pearson Education, Inc. Median Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude often denoted by x~ (pronounced ‘x-tilde’) is not affected by an extreme value Copyright © 2004 Pearson Education, Inc. Finding the Median If the number of values is odd, the median is the number located in the exact middle of the list If the number of values is even, the median is found by computing the mean of the two middle numbers Copyright © 2004 Pearson Education, Inc. 2 5 6 11 13 odd number of values: median is the exact middle value MEDIAN is 6 2 5 6 9 11 13 even number of values: median is the mean of the by two numbers 6+9 2 MEDIAN is 7.5 Copyright © 2004 Pearson Education, Inc. Mode Mode: the value that occurs most frequently. The mode is not always unique. A data set may be: Bimodal Multimodal No Mode example: a. 5.40 1.10 0.42 0.73 0.48 1.10 Mode is 1.10 b. 27 27 27 55 55 55 88 88 99 Bimodal - c. 1 2 3 6 7 8 9 10 No Mode Copyright © 2004 Pearson Education, Inc. 27 & 55 Midrange Midrange: the value midway between the highest and lowest values in the Original data set. Midrange = highest score + lowest score 2 Copyright © 2004 Pearson Education, Inc. Best Measure of Center Copyright © 2004 Pearson Education, Inc. Picking the best “average” The shape of your data may help determine the best measure of center. Outliers may effect the mean, making it too high or too low to represent a “typical” value. If so, the median may be the best choice. Copyright © 2004 Pearson Education, Inc. Shape Figure 2-11 Copyright © 2004 Pearson Education, Inc. Section 2-5 Measures of Variation Created by Tom Wegleitner, Centreville, Virginia Copyright © 2004 Pearson Education, Inc. Measures of Variation “Spread” Because this section introduces the concept of variation, this is one of the most important sections in the entire book. The two most common methods of measuring spread: 1. Range 2. Standard deviation and variance Copyright © 2004 Pearson Education, Inc. Definition The range of a set of data is the difference between the highest value and the lowest value highest value lowest value Copyright © 2004 Pearson Education, Inc. Standard Deviation and Variance measure the amount data values vary (or deviate) from the mean. sample variance: 2 S (x - x) = n-1 sample standard deviation: S= s 2 = (x - x) n-1 Copyright © 2004 Pearson Education, Inc. 2 2 Round-off Rule for Measures of Variation Carry one more decimal place than is present in the original set of data. Round only the final answer, not values in the middle of a calculation. Copyright © 2004 Pearson Education, Inc. Notation Sample Population Statistics Parameters Mean x µ Standard Deviation s σ Variance s2 σ2 Copyright © 2004 Pearson Education, Inc. Sample vs. Population Standard Deviation Note: Unlike x and µ, the formulas for s and σ are not mathematically the same: s= = (x - x) n-1 2 (x - µ) 2 N Copyright © 2004 Pearson Education, Inc. Standard Deviation Key Points s0 ( When would s = 0 ?) The standard deviation is a measure of variation of all values from the mean. The larger s is, the more the data varies. The units of the standard deviation s are the same as the units of the original data values (The variance has units2). The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data values far away from all others) Copyright © 2004 Pearson Education, Inc. Standard Deviation and “Spread” How does “s” show how much the data varies? Three methods: 1. Range Rule of Thumb 2. Chebyshev’s Theorem 3. The Empirical Rule Copyright © 2004 Pearson Education, Inc. The Range Rule of Thumb Range Rule: For most data sets, the majority of the data lies within 2 standard deviations of the mean. Recall: Range = High – Lo Estimate: Range ≈ 4s Alternatively, If the range is known, you can use the range rule to estimate the standard deviation: s Range 4 Copyright © 2004 Pearson Education, Inc. Chebyshev’s Theorem Chebyshev’s Theorem For data with any distribution, the proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1-1/K2, where K is any positive number greater than 1. For K = 2, at least 3/4 (or 75%) of all values lie within 2 standard deviations of the mean For K = 3, at least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean Copyright © 2004 Pearson Education, Inc. The Empirical Rule Empirical (68-95-99.7) Rule For data sets having a symmetric distribution: About 68% of all values fall within 1 standard deviation of the mean About 95% of all values fall within 2 standard deviations of the mean About 99.7% of all values fall within 3 standard deviations of the mean Copyright © 2004 Pearson Education, Inc. The Empirical Rule Copyright © 2004 Pearson Education, Inc. The Empirical Rule Copyright © 2004 Pearson Education, Inc. The Empirical Rule Copyright © 2004 Pearson Education, Inc. Section 2-6 and 2-7 Measures of Position (Relative Standing) Created by Tom Wegleitner, Centreville, Virginia Copyright © 2004 Pearson Education, Inc. Measures of Position Sometimes we want to know the “relative standing” or “relative position” of a particular data value in the set. Some measures of position: 1. Standard Scores (z-scores*) 2. Median, Quartiles, Percentiles Copyright © 2004 Pearson Education, Inc. z-score The z-score (or standard score) for a data value x is the number of standard deviations that x is above or below the mean. Copyright © 2004 Pearson Education, Inc. Computing z-scores To convert a data value x to a z-score: Sample: x x z= s Population x µ z= Round to 2 decimal places Copyright © 2004 Pearson Education, Inc. Interpreting Z Scores FIGURE 2-14 Whenever a value is less than the mean, its corresponding z score is negative Ordinary values: z score between –2 and 2 sd Unusual Values: z score < -2 or z score > 2 sd Copyright © 2004 Pearson Education, Inc. Other Measures of Position Median Quartiles Percentiles Recall: The median separates ranked data into 2 equal parts. Copyright © 2004 Pearson Education, Inc. Quartiles Quartiles separate ranked data into 4 equal parts: Q1 (First Quartile) separates the bottom 25% of sorted values from the top 75%. Q2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%. Q1 (Third Quartile) separates the bottom 75% of sorted values from the top 25%. Copyright © 2004 Pearson Education, Inc. Quartiles Q1, Q2, Q3 divides ranked scores into four equal parts 25% Low 25% 25% 25% Q1 Q2 Q3 (High) (median) Copyright © 2004 Pearson Education, Inc. Percentiles Just as there are quartiles separating data into four parts, there are 99 percentiles denoted P1, P2, . . . P99, which partition the data into 100 groups. Copyright © 2004 Pearson Education, Inc. Tukey’s 5-number Summary Tukey’s 5-number summary: Low Q1 Median Q3 High These 5 numbers can also give another representation of “center and spread.” Copyright © 2004 Pearson Education, Inc. Boxplots A Boxplot (or Box & Whisker plot) is a graphical representation of Tukey’s 5-number summary. example: Figure 2-16 Copyright © 2004 Pearson Education, Inc. Boxplots Figure 2-17 Copyright © 2004 Pearson Education, Inc.