Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AMS7: WEEK 2.CLASS 1 Measures of Variation. Introduction to Probability Wednesday April 8, 2015 Measures to understand Data Variation • Range • Standard Deviation • Variance • Coefficient of Variation • RANGE: (Max. Value) – (Min. Value) • STANDARD DEVIATION: Measure of the variation of observations about the mean. = ) ∑( − −1 SAMPLE STANDARD DEVIATION Measures to understand Data Variation (Cont.) • Short cut to the Standard Deviation Formula ∑ − (∑ ) = ( − 1) Note: This formula is better when you use a calculator with the summation function Example: Standard deviation • Get the standard deviation of 1, 3, 14 (waiting times in minutes): ∑ 1) Compute the mean: = = = 6 ): −5, −3, 8 2) Subtract the mean from each value ( − ) : 25, 9, 64 3) Square each value in 2) ( − ) = 98 4) Sum the values in 3) : ∑( − 5) Divide the quantity in 4) by n-1= 3-1 = 2: 6) Find the square root: 49 = 7.0 mins. = 49 Standard Deviation of a population = ∑( )మ (: sigma) • VARIANCE OF A SAMPLE AND A POPULATION: Is the - square of the standard deviation Sample variance: Population variance: EXAMPLE: Waiting times. Sample Variance=(7.0) = 49 min2 (Units are squared!) Note: is an unbiased estimator of when n is large (Sample value converges to the population value) Coefficient of Variation Ratio of the standard deviation to the mean (normalized measure of dispersion) CV= ⨯ 100% : SAMPLE VALUE CV= ⨯100%: POPULATION VALUE EXAMPLE: Waiting times: CV= ⨯100 %= 116.67 % Rough Estimates • STANDARD DEVIATION: S ≅ EXAMPLE: Cotinin levels: S ≅ = (Range is known) ≅ 123 (Correct value is 119.5) • Minimum and Maximum usual values: - Min. = (mean) - 2⨯ (standard deviation) - Max. = (mean)+ 2⨯ (standard deviation) Rules for data with a Bell- Shaped distribution • EMPIRICAL RULE: - About 68% of all values fall within 1 standard deviation of the mean - About 95% of all values fall within 2 standard deviations of the mean - About 99.7% of all values fall within 3 standard deviations of the mean Example: • Heights of women have a bell-shaped distribution with mean of 163 cm and a standard deviation of 6 cm. What percentage of women have heights between 145 and 181 cm? - 145 and 181 are 3 standard deviations away from the mean: - 163-18 = 145 - 163+18 = 181 - About 99.7% of all values are within 3 standard deviations of the mean 99.7% of women heights are between 145 cm and 181 cm. Comparing values from different data sets • Z Scores: Number of standard deviations above or below the mean -Z= -Z = SAMPLE POPULATION ORDINARY VALUES: -2 ≤ Z score ≤ 2 UNUSUAL VALUES: Z score < -2, Z score > 2 Quartiles • Divide data sorted into four equal parts: 1. Q1 (First quartile): Value with 75% of observations above it, and 25% below it. 2. Q2 ( Second quartile): Same as the median: 50% of observations above it and 50% of observations below it. 3. Q3 (Third quartile): 75% of observations are below it and 25% above it. 1, 3, 6, 10, 15, 21, 28, 36 EXAMPLE: Ordered data set: Q1= 4.5 Q2=12.5 Q3=24.5 Percentiles • The 99 percentiles divide the data into 100 groups. • For a given value X in the data set: Percentile of value X= # ! • EXAMPLE: Cotinine level of smokers 0 1 1 3 17 32 35 44 48 86 87 103 112 121 123 130 131 149 164 167 173 173 198 208 210 222 227 234 245 250 253 265 266 277 284 289 290 313 477 491 • Percentile corresponding to the level of 112= 112 is the 30th Percentile " × 100=30 Exploratory Data Analysis (J. W. Tukey, 1977) • Process of using Statistical Tools (graphs, measures of center, measures of variation) to investigate data sets in order to understand their important characteristics: 1) Center; 2) Variation and 3) Nature of the distribution. • OUTLIERS: Value located very far away from almost all the other values. Relative to the other data, an outlier is an extreme value. BOXPLOTS • Very useful plot for comparing different data sets. You need to have: - Q1: First quartile - Q2: Second quartile - Q3: Third quartile - Maximum value - Minimum value Q1 86.5 Q2 Q3 170 251.5 Min 0 Max 491 COTININE DATA SET BOXPLOT 0 100 200 300 ng/ml 400 500 Outliers • Some boxplot procedures identify extreme outliers Extreme outliers - Values less than Q1- 3⨯(Q3-Q1) (Q3-Q1 is the interquartile range) - Values higher than Q3+3(Q3-Q1) OUTLIER EXAMPLE: Include the value 551 in the cotinine levels 0 100 200 300 400 500 Boxplots and distributions