Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
MEASURES OF SPREAD 1. Range: Maximum - Minimum 2. IQR (Inter Quartile Range)= Q3 − Q1 (Minimum,Q1 , Median, Q3 , Maximum: Five number Summary) 3. Standard Deviation PERCENTILES I p-th percentile: The value below which (roughly) p% of the data points lie. When p=50 we get the median. I The 25-th percentile is called FIRST QUARTILE and denoted by Q1 I The 75-th percentile is called THIRD QUARTILE and denoted by Q3 These are useful in determining “Outliers“ and can be represented in a ‘Box plot‘. These were discussed in your recitation. STANDARD DEVIATION Standard deviation is a measure of the amount of deviation from the mean. STANDARD DEVIATION Computation Let x1 , x2 , . . . , xn be a data set with mean x̄, x-values x1 x2 x3 . . . xn (x − x̄) (x1 − x̄) (x2 − x̄) (x3 − x̄) . . . (xn − x̄) 0 (x − x̄)2 (x1 − x̄)2 (x2 − x̄)2 (x3 − x̄)2 . . . (xn − x̄)2 Pn 2 1 (xi − x̄) STANDARD DEVIATION Computation Pn 2 i=1 (xi −x̄) I Population variance : σ 2 = I Population standard deviation: σ = I Sample variance: s2 = I Sample standard deviation: s = n Pn i=1 (xi −x̄) p Population Variance 2 n−1 p Sample Variance Why the square in (x − x̄)2 ? X (xi − x̄) = X xi − nx̄ = 0 . Some of the deviations from the mean are positive and some negative. When you sum they cancel out each other. So take square to make everything positive Why take the square root of the variance? This ensures that the measure of variability – standard deviation – is in the same unit as the data. Problem 2.57 Find the range, variance and standard deviation of the data 4,2,1,0,1, I Maximum is 4, Minimum is 0 So the range is 4-0 = 4 I X̄ = 1.6 1. x-values 4 2 1 0 1 (x − x̄) (4-1.6) (2-1.6) (1-1.6) (0-1.6) (1-1.6) (x − x̄)2 (4 − 1.6)2 = 5.76 (2 − 1.6)2 = .16 (1 − 1.6.)2 = .36 (0 − 1.6)2 = 2.56 (1 − 1.6.)2 = .36 0 9.20 2. Sample variance = 9.20 /4 = 2.3 √ 3. Sample s.d = 2.3 = 1.52 Properties of Variance I Var (x1 , x2 , . . . , xn ) ≥ 0 and is = 0 only when x1 = x2 = . . . = xn I Var (x1 + b, x2 + b, . . . , xn + b) = Var(x1 , x2 , . . . , xn ) Adding a constant does not change the variance I Var (ax1 , ax2 , . . . , axn ) = a2 Var(x1 , x2 , . . . , xn ) I Var(ax1 + b, ax2 + b, . . . , axn + b) = a2 Var(x1 , x2 , . . . , xn ) Properties of Standard Deviation I S.D (x1 , x2 , . . . , xn ) ≥ 0 and is = 0 only when x1 = x2 = . . . = xn I S.D (x1 + b, x2 + b, . . . , xn + b) = S.D(x1 , x2 , . . . , xn ) Adding a constant does not change the standard deviation I S.D (ax1 , ax2 , . . . , axn ) = |a| S.D(x1 , x2 , . . . , xn ) I S.D (ax1 + b, ax2 + b, . . . , axn + b) = |a| S.D(x1 , x2 , . . . , xn ) What can we say about the Histogram if we know the mean and s.d CHEBYSHEV’s RULE 1 ) k2 I Atleast (1 − mean I k=2: At least 75 % of the observations lie within 2 standard deviations of the mean I k=3: at least 8/9 , approx 90% of the observations lie within 3 standard deviations of the mean part of the histogram lies within ks of the If the histogram is bell shaped then I Approximately 68 % of the observations lie within x̄ − s, x̄ + s I Approximately 95 % of the observations lie within x̄ − 2s, x̄ + 2s I Approximately 99.7 % of the observations lie within x̄ − 3s, x̄ + 3s prob 153, 144 Problem 2.144 If the range of a set of data is 20, find a rough approximation to the s.d of the data set I 75% of the data falls within x̄ − 2s, x̄ + 2s I i.e within a range of x̄ + 2s − x̄ − 2s = 4s I so 4s ≈ range = 20 I so s ≈ 20 4 =5 tophat Numerical measures of relative standing Let µ be the mean of a data set when the data set is the population σ be the s.d of a data set when the data set is the population x̄ be the mean of a data set when the data set is a sample σ be the sd of a data set when the data set is a sample For any value x, The Population z-score of x is z= x −µ σ The Sample z-score of x is z= x − X̄ s The z- score is a measure of “ how many s.d’s is x away from x is k standard deviations away from the mean is same as |x − µ| > k σ. i.e. If x − µ > k σ then x−µ σ >k i.e z > k . So x is larger than µ + k σ is equivalent to z-value of x is larger than k . Similarly, x is smaller than µ − k σ is equivalent to z-value of x is smaller than −k . So in terms of z - values I At least 75 % of the observations have z-values less than 2 I at least 8/9 , approx 90% of the observations have z-values less than 3 I Put differently, At most 10% of the observations have z-values larger than 3 If the histogram is bell shaped then I Approximately 68 % of the observations have z-values less than 1 I Approximately 95 % of the observations have z-values less than 2 I Approximately 99.7 % of the observations have z-values less than 3 Since values that are far away from the mean have very large or very small (negative) z -scores, we can use z-scores to define “outliers“. Observations with z-scores greater than 3 in absolute value are considered outliers. problems 139,140,161 tophat