Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 2 – Aug 29 © 2012 W.H. Freeman and Company Five-number summary and boxplot 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6.1 5.6 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 Largest = max = 6.1 BOXPLOT 7 Q3= third quartile = 4.35 M = median = 3.4 6 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 5 4 3 2 1 Q1= first quartile = 2.2 Smallest = min = 0.6 0 Disease X Five-number summary: min Q1 M Q3 max Boxplots for skewed data Years until death Comparing box plots for a normal and a right-skewed distribution 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Boxplots remain true to the data and depict clearly symmetry or skew. Disease X Multiple Myeloma 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 7.9 6.1 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 8 7 Q3 = 4.35 Distance to Q3 7.9 − 4.35 = 3.55 6 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 5 Interquartile range Q3 – Q1 4.35 − 2.2 = 2.15 4 3 2 1 Q1 = 2.2 0 Disease X Individual #25 has a value of 7.9 years, which is 3.55 years above the third quartile. This is more than 3.225 years, 1.5 * IQR. Thus, individual #25 is a suspected outlier. Measure of spread: the standard deviation The standard deviation “s” is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers. 1. First calculate the variance s2. n 1 2 s2 (x x ) n1 1 i x Mean ±1 s.d. 2. Then take the square root to get the standard deviation s. 1n 2 s ( x x ) i n 11 Density curves A density curve is a mathematical model of a distribution. The total area under the curve, by definition, is equal to 1, or 100%. The area under the curve for a range of values is the proportion of all observations for that range. Histogram of a sample with the smoothed, density curve describing theoretically the population. Median and mean of a density curve The median of a density curve is the equal-areas point: the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if it were made of solid material. The median and mean are the same for a symmetric density curve. The mean of a skewed curve is pulled in the direction of the long tail. Normal distributions Normal – or Gaussian – distributions are a family of symmetrical, bellshaped density curves defined by a mean (mu) and a standard deviation (sigma) : N(). 2 1x 2 1 f(x) e 2 x e = 2.71828… The base of the natural logarithm π = pi = 3.14159… x A family of density curves Here, means are the same ( = 15) while standard deviations are different ( = 2, 4, and 6). Here, means are different ( = 10, 15, and 20) while standard deviations are the same ( = 3). 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30