Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Review BPS chapter 1 Picturing Distributions with Graphs • What is Statistics ? • Individuals and variables • Two types of data: categorical and quantitative • Ways to chart categorical data: bar graphs and pie charts • Ways to chart quantitative data: histograms and stem plots • Interpreting histograms • Time plots Example BPS chapter 1 Indicate whether each of the following variables is categorical or quantitative. a. We have data on 20 individuals measuring amount of time it takes to climb five flights of stairs. Quantitative b. During a clinical trial, an experimental pain relief drug is administered to individuals. Each individual is then asked whether s/he experienced any pain relief. Categorical Objectives (BPS chapter 2) Describing distributions with numbers • Measure of center: mean and median • Measure of spread: quartiles and standard deviation • The five-number summary and boxplots • IQR and outliers • Choosing among summary statistics Measure of center: the mean The mean or arithmetic average To calculate the average, or mean, add all values, then divide by the number of individuals. It is the “center of mass.” Sum of heights is 1598.3 Divided by 25 women = 63.9 inches 58 .2 59 .5 60 .7 60 .9 61 .9 61 .9 62 .2 62 .2 62 .4 62 .9 63 .9 63 .1 63 .9 64 .0 64 .5 64 .1 64 .8 65 .2 65 .7 66 .2 66 .7 67 .1 67 .8 68 .9 69 .6 woman (i) height (x) woman (i) height (x) i=1 x1= 58.2 i = 14 x14= 64.0 i=2 x2= 59.5 i = 15 x15= 64.5 i=3 x3= 60.7 i = 16 x16= 64.1 i=4 x4= 60.9 i = 17 x17= 64.8 i=5 x5= 61.9 i = 18 x18= 65.2 i=6 x6= 61.9 i = 19 x19= 65.7 i=7 x7= 62.2 i = 20 x20= 66.2 i=8 x8= 62.2 i = 21 x21= 66.7 i=9 x9= 62.4 i = 22 x22= 67.1 i = 10 x10= 62.9 i = 23 x23= 67.8 i = 11 x11= 63.9 i = 24 i = 12 x12= 63.1 i = 25 i = 13 x13= 63.9 n=25 x x24= 68.9 = 69.6 25 Mathematical notation: x 1 x 2 .... xn x n n 1 x xi n i 1 1598.3 x 63.9 25 S=1598.3 Learn right away how to get the mean using your calculators. Measure of center: the median The median(M) is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 25 12 6.1 1. Sort observations from smallest to largest. 2. Find the location of the median (L) (1). If n is odd, the median is observation (n+1)/2 down the list n = number of observations n = 25 L=(n+1)/2 = 26/2 = 13 M = 3.4 (2). If n is even, the median is the mean of the two center observations n = 24 L=(n+1)/2 = 12.5 M= (3.3+3.4) /2 = 3.35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 Comparing the mean and the median The mean and the median are the same only if the distribution is symmetrical. In a skewed distribution, the mean is usually farther out in the long tail than is the median. The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean Median Left skew Mean Median Mean and median for skewed distributions Mean Median Right skew Mean and median of a distribution with outliers Percent of people dying x 3.4 x 4.2 Without the outliers With the outliers The mean is pulled to the The median, on the other hand, right a lot by the outliers is only slightly pulled to the right (from 3.4 to 4.2). by the outliers (from 3.4 to 3.6). Impact of skewed data Mean and median of a symmetric distribution Disease X: x 3.4 M 3.4 Mean and median are the same. and a right-skewed distribution Multiple myeloma: x 3.4 M 2.5 The mean is pulled toward the skew. Example: STAT 200 Midterm Score Midterm 30 35 40 40 40 40 45 45 45 45 50 50 55 55 60 65 65 70 100 100 Descriptive Statistics: Midterm Variable N Mean StDev Minimum Q1 Median Q3 Maximum Midterm 20 53.75 18.98 30.00 40.00 47.50 63.75 100.00 Measure of spread: quartiles The first quartile, Q1, is the value in the sample that has 25% of the data at or below it. M = median = 3.4 The third quartile, Q3, is the value in the sample that has 75% of the data at or below it. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 5 6 7 1 2 3 4 5 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 6.1 Q1= first quartile = 2.2 Q3= third quartile = 4.35 Center and spread in boxplots 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6.1 5.6 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 Largest = max = 6.1 7 Q3= third quartile = 4.35 M = median = 3.4 6 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 5 4 3 2 1 Q1= first quartile = 2.2 Smallest = min = 0.6 0 Disease X “Five-number summary” Boxplots for skewed data Comparing box plots for a normal and a right-skewed distribution Years until death 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Boxplots remain true to the data and clearly depict symmetry or skewness. Disease X Multiple myeloma IQR and outliers The interquartile range (IQR) is the distance between the first and third quartiles (the length of the box in the boxplot) IQR = Q3 - Q1 An outlier is an individual value that falls outside the overall pattern. • How far outside the overall pattern does a value have to fall to be considered an outlier? • The 1.5 X IQR Rules for Outliers Low outlier: any value < Q1 – 1.5 IQR High outlier: any value > Q3 + 1.5 IQR Example: STAT 200 Midterm Score IQR = Q3 - Q1 =63.75-40.00=23.75 Low outlier: any value < Q1 – 1.5 IQR = 40.00 - 1.5(23.75) = 4.375 High outlier: any value > Q3 + 1.5 IQR = 63.75 + 1.5(23.75) =99.375 Outliers !! Midterm 30 35 40 40 40 40 45 45 45 45 50 50 55 55 60 65 65 70 100 100 Measure of spread: standard deviation The standard deviation is used to describe the variation around the mean. 1) First calculate the variance s2. 1 n 2 s ( x x ) i n 1 1 2 2) Then take the square root to get the standard deviation s. x Mean ± 1 s.d. 1 n 2 s ( x x ) i n 1 1 Calculations … Women’s height (inches) i xi x (xi-x) (xi-x)2 1 59 63.4 −4.4 19.0 2 60 63.4 −3.4 11.3 3 61 63.4 −2.4 5.6 4 62 63.4 −1.4 1.8 5 62 63.4 −1.4 1.8 6 63 63.4 −0.4 0.1 7 63 63.4 −0.4 0.1 8 63 63.4 −0.4 0.1 9 64 63.4 0.6 0.4 10 64 63.4 0.6 0.4 Sum of squared deviations from mean = 85.2 11 65 63.4 1.6 2.7 Degrees freedom (df) = (n − 1) = 13 12 66 63.4 2.6 7.0 13 67 63.4 3.6 13.3 14 68 63.4 4.6 21.6 Sum 0.0 Sum 85.2 1 n 2 s ( xi x ) n 1 1 Mean = 63.4 s2 = variance = 85.2/13 = 6.55 inches squared s = standard deviation = √6.55 = 2.56 inches Mean 63.4 We’ll never calculate these by hand, so make sure you know how to get the standard deviation using your calculator. Choosing among summary statistics • Otherwise, use the median in the five-number summary, which can be plotted as a boxplot. Height of 30 women 69 68 67 Height in inches • Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers. Plot the mean and use the standard deviation for error bars. 66 65 64 63 62 61 60 59 58 Boxplot plot Box Mean +/sd Mean ± s.d. Example 1 Suppose a sample of twelve lab rats is found to have the following glucose levels: 3 4 4 6 6 6 8 8 9 10 12 15 1. Find the five-number summary of the data and construct box-plot . Min=3, Q1=5, M=7, Q3=9.5, Max=15 2. Based on the box plot, the data set is a. Skewed to left b. roughly symmetric c. skewed to right Example 2 Suppose a researcher is recording fifty values in a database. Suppose she records every value correctly except the lowest value, which is supposed to be “2” but which she incorrectly types as “200”. In the above scenario, the effect of the researcher’s error on mean and Median is: a. Her calculated mean will be lower than it would have been without the error, but her calculated Median will remain unchanged. b. Her calculated mean will be higher than it would have been without the error, but her calculated Median will remain unchanged. c. Her calculated mean will remain unchanged, but her calculated Median will be lower than it would have been without the error. d. Her calculated mean will remain unchanged, but her calculated Median will be lower than it would have been without the error. Example 2 In the above scenario, the effect of the researcher’s error on standard deviation is: a. The error will not affect standard deviation. b. Her calculated standard deviation will be smaller than it would have been without the error. c. Her calculated standard deviation will be larger than it would have been without the error. d. The error is likely to make the calculated standard deviation negative. Example 3 There are three children in a room -- ages 3, 4, and 5. If a four-year-old child enters the room, the a.mean age and variance will stay the same. b.mean age and variance will increase. c.mean age will stay the same but the variance will increase. d.mean age will stay the same but the variance will decrease.