Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 3.2 Measures of Central Tendency, Variation, and Shape Cover Summation Notation Ex. 1 10 3n n=1 Ex. 2 6 2n + 3 n=2 Ex. 3 5 Xn = X1 + X2 + X3 + X4 + X5 n=1 The Arithmetic Mean Arithmetic Mean (or just Mean) – Used to measure the central tendency. It is easily thrown off by extreme values (or outliers) _ Notation: X represents the mean of a set of values. X1, X2, X3, X4, X5 represents individual values. The mean of the above 5 values is computed as follows: _ X = X1 + X2 + X3 + X4 + X5 -----------------------------5 _ n In general, X = Xi i=1 _____ n = X1 + X2 + X3 +…+ Xn _________________________ n The Median a. The median is the middle value in an ordered array of data. b. Not affected by outlier (extreme values) c. Can be used in place of the Mean when outliers exist. N - Odd If the number of observations is odd, the median is the value in position (n+1) 2 Ex 4. 2 4 6 9 10 N - Even If the number of observations is even, the median is the average of the two middle observations. Ex 5. 2 4 6 9 10 11 Median is 6 + 9 2 Note: Both the Mean and Median are measures of central tendency. The Mode The mode is the value in the set of observations that occurs most frequently. a. Not affected by outliers (extreme values) b. Used mainly for descriptive purposes since this value could vary from sample to sample. Ex 6. 5 6 6 8 9 10 10 10 11 13 Mode is 10 Quartile First Quartile : Q1 is in position Third Quartile: Q3 is in position n 1 4 3( n 1) 4 If the position is half way between two integers, use the average of the two integers. If the position is neither an integer nor half way between two integers, round to the nearest integer. Ex 7. 2.4 3.6 4.7 5.8 6.7 6.8 6.9 7.1 8.4 (nine observations) Q1: 9 + 1 = 10/4 = 2.5 4 Q1 = average of the 2nd and 3rd value. Q1 = 3.6 + 4.7 = 4.15 2 Ex 8. 2.3 2.4 3.5 3.6 5.6 6.2 6.9 8.0 Q3: (eight observations) 3(8 + 1) = 27/4 = 6.75 which rounds to 7 4 Q3 = 6.9 (which is the seventh observation) Measures of Variation See pg. 117 Measure of how spread out the data are. Five measures of variation 1. Range 2. Interquartile Range 3. Variance 4. Standard Deviation 5. Coefficient of Variation Range a. A very simple measure of variation b. Doesn’t take into account the data between the largest and smallest value c. Measure of total spread (using end values) Range = largest value – smallest value Interquartile Range a. Difference between the third and first quartiles b. Not affected by outliers c. Measure of middle spread Interquartile Range = Q3 – Q1 Ex 9. 4 5 6 8 9 12 13 15 15 17 (ten observations) Q1 is in position (10 + 1) = 4 11/4 = 2.75 (approx. 3) Q1 = 6 (number in the third position) Q3 is in position 3(10 + 1) = 33/4 4 Q3 = 15 (number in the eight position) Interquartile Range is Q3 – Q1 = 15 – 6 = 9 = 8.25 (approx. 8) Sample Variance See pg. 119 Exhibit 3.1 a. Takes into account all data b. Shows how a set of data is distributed around the mean. c. Measure of the average scatter around the mean. d. The result is a number in squared units from the original data. (data is in inches, Variance would be in inches squared. e. Variance is denoted by S 2 n S 2 (X i X ) 2 i 1 n 1 Ex 10. Consider the set of values that represent the age at which a sample of 6 people graduated from college: 18 22 _ X 22 22 22 22 22 22 Xi 18 22 22 23 24 26 S2 22 23 _ (X i - X) (18 - 22) = -4 (22 -22) = 0 (22 -22) = 0 (23 -22)=1 (24-22)=2 (26-22)=4 = 16 + 0 + 0 + 1 + 4 + 16 6-1 24 26 _ (X i - X)2 (-4)2 =16 (0)2 =0 (0)2 =0 (1)2 =1 (2)2 =4 (4)2 = 16 (makes the value positive) = 37/5 = 7.4 (note: in units squared) Sample Standard Deviation a. b. c. d. e. f. g. Takes into account all data. Primary measure of variation Shows how a set of data is distributed around the mean. Measure of the average scatter around the mean. The Standard Deviation is the square root of the sample variance. This results in a number that has the same units as the individual values. Standard Deviation is denoted by S S Ex 11. S 2 Using above sample S 7.4 = 2.720294101 approx. 2.72 (same units as the original values) Which of the following will have a Standard Deviation of 0? Between ex a. and ex c., which one will have the largest Standard Deviation? ex a. 1 6 7 11 21 21 35 ex b. 3 3 3 3 3 3 3 ex c. 4 4 5 8 10 11 11 Pg. 121 For most sets of data, the majority of the values are within one Standard Deviation (1*S) of the mean X 1S and X 1S For example, using Ex 10 ie. The majority of the values lie between 22 – 2.72 and 22 + 2.72 Coefficient of Variation (CV) 1. Measures the scatter in the data relative to the mean. 2. Relative measure of variation expressed as a percentage CV = S (100%) X as S.D increases, CV increases as X increases, CV decreases The CV is useful when comparing two sets of data that are measured in different units. (SD / Mean -- units cancel) Shape of a set of data Asymmetrical data – not symmetrical – skewed left or right. Skewed Left – Negative Skew Mean < Median Extreme low values throw the mean off (decrease the mean) Skewed Right – Positive Skew Mean > Median Extreme high values throw the mean off (increase the mean) Symmetrical – Not Skewed Mean = Median No extreme values Low and high values balance each other. 20 64 22 22 24 24 24 26 27 32 67 67 76 76 80 89 90 99 100 Mean = 50.87879 Median = 46 34 35 35 36 45 45 46 46 54 54 54 54 56 56 Sec 3.3 Exploratory Data Analysis The 5-Number summary X smallest Q1 Median Q3 X largest Right-Skewed Distributions – distance from median to Xlargest > distance from Xsmallest to median Right-Skewed Distributions – distance from Q3 to Xlargest > distance from Xsmallest to Q1. Left-Skewed Distributions -- distance from Xsmallest to median > distance from median to Xlargest Left-Skewed Distributions – distance from Xsmallest to Q1 > distance from Q3 to Xlargest. Recall: Q1 = n+1 position of observation 4 Q3 = 3(n+1) position of observation 4 3.3 continued Box-and-Whisker Plot (uses the 5-number summary) Five-number Summary Minimum First Quartile Median Third Quartile Maximum Plot skewed left skewed right Min Q1 Median 50 77 85 89 100 Q3 Max Sec 3.4 Recall: _ X represents sample mean S2 represents sample variance S represents sample standard deviation If the data set represents an entire population instead of just a sample … Population Mean represents the mean of the population (read as mu). N represents the number of observations. Xi represents the ith individual observation. N X = i 1 i N 2 (lowercase letter sigma. Read as “Sigma Squared”) Population variance ( X i) N 2 2 i 1 N Population Standard Deviation (Square root of variance) =2 18 _ X 22 22 22 22 22 22 Example Xi 18 22 22 23 24 26 S2 S2 = 7.4 22 22 23 _ (X i - X) (18 - 22) = -4 (22 -22) = 0 (22 -22) = 0 (23 -22)=1 (24-22)=2 (26-22)=4 24 26 _ (X i - X)2 (-4)2 =16 (0)2 =0 (0)2 =0 (1)2 =1 (2)2 =4 (4)2 = 16 (makes the value positive) = 16 + 0 + 0 + 1 + 4 + 16 = 37/5 = 7.4 (note: in units squared) 6 = 2.720294101 approx. 2.72 Empirical Rule -- Not to be used with data sets that are highly skewed 1. Approx. 67% (2/3) of the observations lie within a distance of +- 1 S.D. of the mean 67% lie between 22 + 2.72 and 22 – 2.72. Between 24.72 and 19.28 2. Approx. 95% of the observations lie within a distance of +- 2 S.D. of the mean 95% lie between 22 + 2*2.72 and 22 – 2*2.72. Between 27.44 and 16.56 Sec 3.5 Coefficient of Correlation Error – Pg. 138 States “In section 2.5, scatter diagrams are used to .. yadda yadda “ Scatter Diagrams were discussed in Section 2.3 Recall: Scatter Diagram – graphically displays bivariate (two variables) numerical data. Coefficient of correlation (r) – numerical description for measuring the strength of the relationship between two variables. -- Measures the degree of linear association between two variables. -- Values range from –1 (perfect negative correlation) to 1 (perfect positive correlation) -- Perfect means that all points could be connected with a straight line.) -- The relationship between 2 variables is described as a “tendency” and not as a “cause & effect”. -- Correlation alone can not prove that the change in one variable caused the change in the other variable. -- Further analysis is needed to prove causation. -- Causation implies correlation but correlation does not imply causation. Column B The Coefficient of Correlation is computed as follows: 120 100 80 60 40 20 0 n ( X r i 1 n i X )(Y i Y ) n ( X i X ) (Y i Y ) i 1 0 20 40 60 80 2 2 i 1 100 Column A X i 3 7 9 11 Y Xi X i 20 15 14 10 3-7.5 = -4.5 7-7.5 = -0.5 9-7.5 = 1.5 11-7.5 = 3.5 (X i X ) Y i Y 2 (Y i Y ) 2 20-14.75 = 5.25 15-14.75 = 0.25 14-14.75 = -0.75 10-14.75 = -4.75 X = (3+7+9 +11)/4 = 7.5 Y = (20+15+14+10)/4 = 14.75 Column B Positive Correlation. As one variable increases, the other variable increases. Negative Correlation. As one variable increases, the other variable decreases. 120 100 80 60 40 20 0 Zero Correlation. As one variable changes, the other variable stays constant. Coefficient of correlation Pg. 140 0 20 40 60 Column A study) may be the cause Cause and effect 80 100 Explanations for a correlation between two variables. Caused by chance A third variable (not included in the Sec 3.6 Pitfalls in numerical descriptive measures and ethical issues Read, Read, Read