Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IQL Chapter 4 – Describing Data Statistical Reasoning for everyday life, Bennett, Briggs, Triola, 3rd Edition 4.1 – What is Average? Mean, Median and Mode Definitions—Measures of Center in a Distribution The mean is what we most commonly call the average value. It is found as follows: The median is the middle value in the sorted data set (or halfway between the two middle values if the number of values is even). The mode is the most common value (or group of values) in a data set. Rounding Rule for Statistical Calculations State your answers with one more decimal place of precision than is found in the raw data. Example: The mean of 2, 3, and 5 is 3.3333 . . . , which we round to 3.3. Because the raw data are whole numbers, we round to the nearest tenth. As always, round only the final answer and not any intermediate values used in your calculations. Effects of Outliers Definition An outlier in a data set is a value that is much higher or much lower than almost all others Page 1 of 8 In general, the value of an outlier has no effect on the median, because outliers don’t lie in the middle of a data set. Outliers do not affect the mode either. (However, the median may change if we delete an outlier, because we are changing the number of values in the data set.) Confusion About “Average” Averages can be confusing as it is not clear what is meant by average; mean or median, additionally often times there is not enough information given as to how the average was arrived at, such as is there outliers. Consider what the ‘average’ would be in hourly wages paid if the president of the company’s annual earnings were included. Weighted Mean Definition A weighted mean accounts for variations in the relative importance of data values. Each data value is assigned a weight and the weighted mean is weighted mean = Means with Summation Notation (Optional) The symbol Σ (the Greek capital letter sigma) is called the summation sign and indicates that a set of numbers should be added. We use the symbol x to represent each value in a data set, so we write the sum of all the data values as Sum of all values = Σx Page 2 of 8 Means and Medians with Binned Data (Optional) The ideas of this section can be extended to binned data simply by assuming that the middle value in the bin represents all the data values in the bin. For example, consider the following table of 50 binned data values: Bin Frequency 0-6 10 7-13 10 14-20 10 21-27 20 4.2 –Shapes of Distributions Number of Modes Modes: When describing data; the mode refers to the shape or number of peaks in the visual display. It is similar to the quantitative mode in that the peaks are usually higher counts in the data set, but this is a qualitative use. Uniform Distribution: has no mode Bimodal: 2 peaks Single – Peaked/Unimodal Trimodal: 3 peaks/modes Page 3 of 8 SYMMETRY OR SKEWNESS Definitions A distribution is symmetric if its left half is a mirror image of its right half. A distribution is left-skewed if its values are more spread out on the left side. A distribution is right-skewed if its values are more spread out on the right side. VARIATION Definition Variation describes how widely data are spread out about the center of a data set. 4.3 Measures of Variation WHY VARIATION MATTERS In this section we will look at variation in quantitative manner/measures. 4.3 Meausres of Variation Big Bank (three lines) 4.1 4.5 Best Bank (one line) 6.6 6.7 5.6 6.7 6.2 6.9 6.7 7.1 7.2 7.2 7.7 7.3 7.7 7.4 8.5 7.7 9.3 7.8 AVG 11 7.14 7.8 7.2 The wait is only slightly longer at the Big Bank, but the satisfaction comes from the variation at the two banks. Page 4 of 8 RANGE Definition The range of a set of data values is the difference between its highest and lowest data values: range = highest value (max) - lowest value (min) Quartiles and Five – Number Summary Quartiles are values that divide the data distribution into quarters. Definitions The lower quartile (or first quartile or Q1) divides the lowest fourth of a data set from the upper threefourths. It is the median of the data values in the lower half of a data set. (Exclude the middle value in the data set if the number of data points is odd.) The middle quartile (or second quartile or Q2) is the overall median. The upper quartile (or third quartile or Q3) divides the lowest three-fourths of a data set from the upper fourth. It is the median of the data values in the upper half of a data set. (Exclude the middle value in the data set if the number of data points is odd.) The Five-Number Summary The five-number summary for a data distribution consists of the following five numbers: low value lower quartile median upper quartile high value Page 5 of 8 Five – Number Summaries are typically displayed using a Boxplot, below is the steps for drawing a boxplot. Drawing a Boxplot Step 1. Draw a number line that spans all the values in the data set. Step 2. Enclose the values from the lower to the upper quartile in a box. (The thickness of the box has no meaning.) Step 3. Draw a line through the box at the median. Step 4. Add “whiskers” extending to the low and high values. PERCENTILES Quartiles divide a data set into 4 segments. There are times when it is more useful to divide data sets into more segments. Quintiles divide a data set into 5 segments, and deciles divide a data set into 10 segments. When the data set is larger or you wish to divide the data set into 100 segments, you then use percentiles Definition The nth percentile of a data set divides the bottom n% of data values from the top (100 - n)%. A data value that lies between two percentiles is often said to lie in the lower percentile. You can approximate the percentile of any data value with the following formula: percentile of data value = STANDARD DEVIATION Excel Function: STDEV Estimates standard deviation, assuming that the arguments represent only a sample of the total population, and takes the form =STDEV(number1,number2,…), accepting up to 30 arguments. The Standard Deviation is a measure of how wodely data values are spread around the mean of each data set. The calculation is set forth below. Page 6 of 8 Calculating the Standard Deviation To calculate the standard deviation for any data set: Step 1. Compute the mean of the data set. Then find the deviation from the mean for every data value by subtracting the mean from the data value. That is, for every data value, deviation from mean = data value – mean Step 2. Find the squares (second power) of all the deviations from the mean. Step 3. Add all the squares of the deviations from the mean. Step 4. Divide this sum by the total number of data values minus 1. Step 5. The standard deviation is the square root of this quotient Overall, these steps produce the standard deviation formula: (This formula is shown in summation notation on slide 36.) IINTERPRETING THE STANDARD DEVIATION The Range Rule of Thumb is an approximation that allows for interpretation of the Standard Deviation. The Range Rule of Thumb The standard deviation is approximately related to the range of a distribution by the range rule of thumb: standard deviation ≈ If we know the range of a distribution (range = high – low), we can use this rule to estimate the standard deviation. Alternatively, if we know the standard deviation, we can use this rule to estimate the low and high values as follows: low value ≈ mean – (2 x standard deviation) high value ≈ mean + (2 x standard deviation) The range rule of thumb does not work well when the high or low values are outliers. Page 7 of 8 STANDARD DEVIATION WITH SUMMATION NOTATION (OPTIONAL SECTION) The summation notation introduced earlier makes it easy to write the standard deviation formula in a compact form. The symbol s is the conventional symbol for the standard deviation of a sample. For the standard deviation of a population, statisticians use the Greek letter s (sigma), and the term n - 1 in the formula is replaced by n. Consequently, you will get slightly different results for the standard deviation depending on whether you assume the data represent a sample or a population. 4.4 – STATISTICAL PARADOXES Page 8 of 8