Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Section 3.1 – Measures of Central Tendency A measure of center is a value at the center (or middle) of a data set. Notation: x = variable representing data values n = total number of values in the sample ∑ = sum or “add up” N = total number of values in the population (difficult to know) Arithmetic Mean – Add the data values and divide by the total number of values. x = sum of the data values divided by n (statistic) n x μ = population mean = (parameter) N x = sample mean = Median – Data point that lies in the middle of the data set when arranged from smallest to largest. n 1 If you have an ODD number of values, the median will be the middle value. 2 n n If you have an EVEN number of values, take the average of the middle two values. and 1 2 2 When data are skewed left or right, that means there are extreme values in the tail which are pulling the mean in the direction of the tail. (see Table 4 on pg. 134) Distribution Shape Skewed Left Symmetric Skewed Right Mean vs. Median Mean < Median Mean = Median Mean > Median Mode – The value that occurs most often. The data set can be bimodal, multimodal, or have no mode. (you can find the mode of qualitative data, too!) 1 Example: Find the mean, median, and mode for the following data set. (flight times in minutes of six flights from Las Vegas to Newark on Continental Airlines) 282 270 260 266 257 260 Finding Mean and Median on the calculator: 1. Enter the data into a list (L1) 2. Select STATS 3. Arrow over to the menu labeled “CALC” 4. Highlight “1-VAR STATS” and push enter. 5. Type the name of the list (L1 or L2 , etc.) 6. Push enter and scroll down. Note: The default list for 1-VarStats is L1 A numerical summary of data is resistant if extreme values (very large or small) do not affect its value substantially. Which of the following is considered resistant: Mean or Median? 2 Section 3.2 – Measures of Dispersion Range = highest value – lowest value Variance is based on how the data points deviate from the mean. Variance of a sample = s2 = Notation: (x x ) x ( xi 2 ) n i 2 i n 1 or n 1 2 n = the number of data values in the sample xi = the individual data values xi 2 means “square each data point, then add them all up x i 2 means “add up all of the data points, then square the result” Standard Deviation of a sample = s = s2 Sample Variance: Population Variance: Sample Standard Deviation: Population Standard Deviation: Properties and Interpretations of Standard Deviation: 1. The standard deviation is a measure of variation of all values from the mean. We say the standard deviation is the “typical deviation from the mean”. The larger the standard deviation, the more dispersion the distribution has. 2. The value of the standard deviation is either zero or positive. (Zero means that all of the data points are the same number) 3. Outliers cause the standard deviation to increase dramatically. 4. The units of the standard deviation are the same as the units of the original data points. 3 Example: Find the range, variance, and standard deviation of the following data set: (flight times in minutes of six flights from Las Vegas to Newark on Continental Airlines) 282 270 260 266 257 260 Example: See problem #24 on pg. 154 Uses your calculator (1-VarStats) to find the standard deviation for each group. Which color has more variation in the responses? Participant Reaction Time (BLUE) Reaction Time (RED) 1 2 3 4 5 6 0.582 0.481 0.841 0.267 0.685 0.450 0.408 0.407 0.542 0.402 0.456 0.533 Empirical Rule for Bell-shaped data: (see the picture on pg. 151) About 68% of all data values are within 1 standard deviation of the mean (ie, and ) About 95% of all data values are within 2 standard deviations of the mean (ie, 2 and 2 ) About 99.7% of all data values are within 3 standard deviations of the mean (ie, 3 and 3 ) 4 Example: IQ scores of normal adults have a bell-shaped distribution with a mean of 100 and a standard deviation of 15. a) What percentage of adults have IQ scores between 55 and 145? b) What percentage of adults have IQ scores greater than 145? 5 Chebyshev’s Inequality: (works for any data set, not just bell-shaped ones) 1 At least 1 2 100 % of the data values will lie within k standard deviations of the mean. k (k > 1) Example: According to the US Census Bureau, the mean commute time to work for a resident of Boston, MA is 27.3 minutes. Assume that the standard deviation of commute time is 8.1 minutes. a) What percentage of commuters in Boston has a commute time within 2 standard deviations of the mean? b) What percentage of commuters have commute times between 3 and 51.6 minutes? Section 3.4 – Measures of Position and Outliers Z-score – The distance that a data value is from the mean in terms of the number of standard deviations. Negative z-scores indicate that the data point is below the mean. Positive z-scores indicate that the data point is above the mean. z xx (sample) s z x (population) ALWAYS ROUND Z-SCORES TO 2 DECIMAL PLACES!! 6 Example: Three students take equivalent stress tests. Which of the following scores is the highest relative score? Student 1: A score of 144 on a test with a mean of 128 and a standard deviation of 34. Student 2: A score of 90 on a test with a mean of 86 and a standard deviation of 18. Student 3: A score of 18 on a test with a mean of 15 and a standard deviation of 5. Example: For men aged 18–24 years, the serum cholesterol levels have a mean of 178.1 and a standard deviation of 40.7 (measured in mg/100mL). Find the z-score corresponding to a male, aged 18–24 years, who has a serum cholesterol level of 359. Quartiles and Percentiles Recall that the median separates the sorted data into 2 equal parts, the lower 50% and the upper 50%. The kth percentile, Pk, of a set of data is a value (data point) such that k percent of the data points are less than or equal to that value. Percentiles – separates the sorted data into 100 equal parts with 1% of the data values in each group. P13 separates the lower 13% from the upper 87% P55 separates the lower 55% from the upper 45% etc… 7 Quartiles – separates the sorted data into 4 equal parts with 25% of the data values in each group. Q1 separates the lower 25% from the upper 75% Q2 separates the lower 50% from the upper 50% (also called the median) Q3 separates the lower 75% from the upper 25% BEFORE CALCULATING PERCENTILES, THE DATA MUST BE SORTED LOW TO HIGH!!! Example: The following 40 data points are the sorted final exam grades of students who took the statistics final exam in 2005. Calculate Q1, Q2, and Q3. 46 65 74 84 51 66 75 84 53 67 75 84 54 70 75 84 57 70 76 87 58 71 77 90 58 71 77 90 60 72 83 92 62 74 83 93 64 74 83 95 Quartiles can also be found on your calculator using 1-VarStats (scroll down) The Interquartile Range (IQR) – the range of the middle 50% of the values in a data set. IQR = Q3 – Q1 Interpretation: The more spread out a data set is, the higher the IQR will be. Example: Calculate the IQR for the last example (test grades). 8 Outlier – An extreme observation (high or low) How to tell if a data point is an outlier or not: 1. Calculate Q1 and Q3. 2. Calculate the IQR. 3. Determine the “fences”. Lower fence = Q1 – 1.5(IQR) Upper fence = Q3 + 1.5(IQR) 4. If a data point is less than the lower fence or higher than the upper fence, then it is an OUTLIER! Example: The following data represent the hemoglobin levels for 20 randomly selected cats. 5.7 10.0 7.7 10.3 7.8 10.6 8.7 10.7 8.9 11.0 9.4 11.2 9.5 11.7 9.6 12.9 9.6 13.0 9.9 13.4 a) Calculate the mean and standard deviation of the hemoglobin level for this data set. b) A cat named Daisy had a hemoglobin level of 7.8. Calculate her z-score and interpret. c) Calculate the IQR and the fences. d) Are there any outliers? Which statistics should I use for my data set? If the distribution is SYMMETRIC, the best measure of central tendency is the _________________, and the best measure of dispersion is the ______________________. If the distribution is SKEWED, the best measure of central tendency is the _________________, and the best measure of dispersion is the ______________________. 9 Section 3.5 – The Five-Number Summary and Boxplots 5-Number Summary: 1. 2. 3. 4. 5. Minimum value Maximum value Q1 Q2 (median) Q3 How to Draw a Boxplot: 1. Draw a box with vertical lines at Q1, Q2 , and Q3 . 3. Draw a horizontal line from Q1 to the smallest data point inside the lower fence. 4. Draw a horizontal line from Q3 to the largest data point inside the upper fence. 5. Label any outliers with asterisks (*). 2. Draw brackets at the lower and upper fences. Example: Create a Box & Whisker Plot for the “birth month” data for both classes. Compare. 10 Example: See #6 on pg. 181 a) What is the median of variable x? b) What is the first quartile of variable y? c) Which variable has more dispersion? Why? d) Does the variable x have any outliers? If so, list them. e) Describe the shape of variable x. f) Describe the shape of variable y. 11