Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive Statistics August 27, 2012 Overview of Descriptive Statistics I Descriptive Statistics are used to describe the basic features of the data gathered from an experimental study. Overview of Descriptive Statistics I Descriptive Statistics are used to describe the basic features of the data gathered from an experimental study. I They provide simple summaries about the sample and the measures. Overview of Descriptive Statistics I Descriptive Statistics are used to describe the basic features of the data gathered from an experimental study. I They provide simple summaries about the sample and the measures. I Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Basic Features of Data I The size of the sample is usually denoted n. Basic Features of Data I The size of the sample is usually denoted n. I The measures of the central tendency are mean, median, and mode. Basic Features of Data I The size of the sample is usually denoted n. I The measures of the central tendency are mean, median, and mode. I The measures of spread or variation are min, max, variance and standard deviation, fs (IQR), outliers. Basic Features of Data I The size of the sample is usually denoted n. I The measures of the central tendency are mean, median, and mode. I The measures of spread or variation are min, max, variance and standard deviation, fs (IQR), outliers. I The measures of position are percentiles, deciles, fourths (quartiles). Measures of the Center of the Data Overview I What is normal for this population? Overview I What is normal for this population? I We try to understand this question by asking another question. Overview I What is normal for this population? I We try to understand this question by asking another question. I What is the central value of the data? µ vs x I The mean µ is the average value of a population. µ vs x I The mean µ is the average value of a population. I The sample mean x is the average value of the sample. x= x1 + x2 + . . . + xn n where n is the size of the sample x1 , x2 , . . . , xn . The Median x̃ (or simply M) I x̃ denotes the median, which is a number that splits the data into two parts of equal size. The Median x̃ (or simply M) I x̃ denotes the median, which is a number that splits the data into two parts of equal size. I If the number of data is odd, then x̃ is the central number. The Median x̃ (or simply M) I x̃ denotes the median, which is a number that splits the data into two parts of equal size. I If the number of data is odd, then x̃ is the central number. I If the number of data is even, then x̃ is the midpoint of the two central numbers. The Median x̃ e.g. For 3; 3; 4; 5; 7; 8; 10 the median is x̃ = 5. e.g. For 2.1; 4.2; 4.3; 4.9; 5.0; 5.2 the median is x̃ = 4.3+4.9 2 = 4.6 Mode I The mode is the most common value in a sample. Mode I The mode is the most common value in a sample. I e.g. x̃ = 7 for S = {3, 5, 7, 7, 7, 9} Compare the various measures of the “center” Consider the sample 100; 100; 100; 100; 500; 600; 600; 700; 800 I The mode is 100 Compare the various measures of the “center” Consider the sample 100; 100; 100; 100; 500; 600; 600; 700; 800 I The mode is 100 I The median is x̃ = 500 Compare the various measures of the “center” Consider the sample 100; 100; 100; 100; 500; 600; 600; 700; 800 I The mode is 100 I The median is x̃ = 500 I The sample mean is 100 + 100 + 100 + 100 + 500 + 600 + 600 + 700 + 800 9 = 400 x= Measuring the Spread of Data Limitations of max and min Consider the following data sets. I 1; 5; 5; 5; 5; 5; 5; 5; 9; I 1; 2; 3; 4; 5; 6; 7; 8; 9; Deviation How far away from the average is a given piece of data? I We could just take the absolute value... Deviation How far away from the average is a given piece of data? I We could just take the absolute value... I 1; 2; 3; 4; 5; 6; 7; 8; 9; Deviation How far away from the average is a given piece of data? I We could just take the absolute value... I 1; 2; 3; 4; 5; 6; 7; 8; 9; I µ = 5, x3 = 3, |µ − x3 | = 2... Deviation How far away from the average is a given piece of data? I We could just take the absolute value... I 1; 2; 3; 4; 5; 6; 7; 8; 9; I µ = 5, x3 = 3, |µ − x3 | = 2... I so x3 is 2 away from the average. Deviation How far away from the average is a given piece of data? I We could just take the absolute value... I 1; 2; 3; 4; 5; 6; 7; 8; 9; I µ = 5, x3 = 3, |µ − x3 | = 2... I so x3 is 2 away from the average. I Similarly x8 = 8 is 3 from the average. Deviation How far away from the average is a given piece of data? I We could just take the absolute value... I 1; 2; 3; 4; 5; 6; 7; 8; 9; I µ = 5, x3 = 3, |µ − x3 | = 2... I so x3 is 2 away from the average. I Similarly x8 = 8 is 3 from the average. I These examples illustrate the absolute deviations of x3 and x8 from the mean. Standard Deviation I The standard deviation is another way to measure how far away a data point is relative to the other data. Calculating the Standard Deviation of a sample I We first calculate the sample variance s2 = (x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2 n−1 Calculating the Standard Deviation of a sample I We first calculate the sample variance s2 = I (x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2 n−1 The sample standard deviation is the square root of the sample variance s √ (x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2 s = s2 = n−1 Using the Standard Deviation 1; 2; 3; 4; 5; 6; 7; 8; 9; I Which data are at least 1 standard deviation from the mean? Using the Standard Deviation 1; 2; 3; 4; 5; 6; 7; 8; 9; I Which data are at least 1 standard deviation from the mean? I We want those data xi such that |xi − x| ≥ s Using the Standard Deviation 1; 2; 3; 4; 5; 6; 7; 8; 9; I Which data are at least 1 standard deviation from the mean? I We want those data xi such that |xi − x| ≥ s I |xi − 5| ≥ 2.74 for x1 = 1, x2 = 2, x8 = 8 and x9 = 9 Calculating the Standard Deviation of a sample I We first calculate the sample variance 2 s = = (x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2 n−1 (1 − 5)2 + (2 − 5)2 + (3 − 5)2 + (4 − 5)2 + (6 − 5)2 + (7 − 5)2 + (8 − 5)2 + (9 − 5)2 = 7.50 9−1 Calculating the Standard Deviation of a sample I We first calculate the sample variance 2 s = = (x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2 n−1 (1 − 5)2 + (2 − 5)2 + (3 − 5)2 + (4 − 5)2 + (6 − 5)2 + (7 − 5)2 + (8 − 5)2 + (9 − 5)2 9−1 = 7.50 I The sample standard deviation is the square root of the sample variance √ √ s = s 2 = 7.50 ≈ 2.74 Measures of Position Percentiles I The percentile of an observation is the percent of the data less than (or equal to) that observation. I The median is the 50th percentile. I The lower fourth (first quartile) Q1 is the 25th percentile. I The upper fourth (third quartile) Q3 is the 75th percentile. Calculating Percentiles I The pth percentile for a ranked data set consisting of n observations is found by a two step procedure Calculating Percentiles I The pth percentile for a ranked data set consisting of n observations is found by a two step procedure I Compute the index i= p n 100 Calculating Percentiles I The pth percentile for a ranked data set consisting of n observations is found by a two step procedure I Compute the index p n 100 If i is not an integer, the next integer greater than i locates the position of the pth percentile in the ranked data set. If i is an integer, the p th percentile is the average of the observations in positions i and i + 1 in the ranked data set. i= I Percentiles: an example I 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 I Lower fourth (First quartile): Q1 = median of bottom half = 320 I Median 342 + 344 = 343 2 Upper fourth (Third quartile) x̃ = I Q3 = median of top half = 349 Percentiles: an example I 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 p25 = 320 NOTE: p25 = Q1 (the data in the third position) Deciles: an example I 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 Deciles: an example I 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 I The 1st decile is the 10th percentile 315 = 312 + 318 2 Deciles: an example I 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 I The 1st decile is the 10th percentile 315 = I 312 + 318 2 The 8th decile is the 80th percentile 349.5 = 349 + 350 2 An Exercise I 34; 35; 36; 38; 40; 45; 51; 63 I Find the upper and lower fourths (first and third quartiles) and the median. I Find the 90th percentile. fs : Fourths Spread (or IQR: interquartile range) fs = Q3 − Q1 I The fourths spread (interquartile range) is the length of the interval containing the middle 50% of the data. fs : Fourths Spread (or IQR: interquartile range) fs = Q3 − Q1 I The fourths spread (interquartile range) is the length of the interval containing the middle 50% of the data. I We will also see that this is the width of the box in a box and whisker plot. fs : Fourths Spread (or IQR: interquartile range) fs = Q3 − Q1 I The fourths spread (interquartile range) is the length of the interval containing the middle 50% of the data. I We will also see that this is the width of the box in a box and whisker plot. I This is another way to measure the spread of the data and it is also (almost) a way to measure the middle. IQR: an example I 1; 2; 3; 4; 5; 6; 7; 8; 9; IQR = Q3 − Q1 IQR = Q3 − Q1 IQR: an example I 1; 2; 3; 4; 5; 6; 7; 8; 9; I Q1 : i= so Q1 = 3 25 9 = 2.25 100 IQR = Q3 − Q1 IQR: an example I 1; 2; 3; 4; 5; 6; 7; 8; 9; I Q1 : i= so Q1 = 3 I Similarly Q3 = 7 25 9 = 2.25 100 IQR = Q3 − Q1 IQR: an example I 1; 2; 3; 4; 5; 6; 7; 8; 9; I Q1 : i= so Q1 = 3 I Similarly Q3 = 7 I IQR = 7 − 3 = 4 25 9 = 2.25 100 Outliers I An outlier is a datum that does not fit the rest of the data. Outliers I An outlier is a datum that does not fit the rest of the data. I A common way to determine which data are outliers is to say that an outlier is a sample that is more than 1.5(IQR) outside the middle 50%. Outliers I An outlier is a datum that does not fit the rest of the data. I A common way to determine which data are outliers is to say that an outlier is a sample that is more than 1.5(IQR) outside the middle 50%. I e.g. 1.5(IQR) = 1.5(4) = 6 So, an outlier would have to be below Q1 − 6 = 3 − 6 = −3 or above Q3 + 6 = 7 + 6 = 13 We see that there are no outliers in this sample. Box and Whisker Plots I A Box and Whisker Plot is a visual representation of data that focuses on the fourths (quartiles). I To construct a box plot, use a horizontal number line and a rectangular box. The smallest and largest data values label the endpoints of the axis. I The lower fourth (first quartile) marks one end of the box and the upper fourth (third quartile) marks the other end of the box. I The middle fifty percent of the data fall inside the box. I The ”whiskers” extend from the ends of the box to the smallest and largest data values. I The box plot gives a good quick picture of the data. Box and Whisker Plots: An Example Create a box plot of the following data 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 Box and Whisker Plots: An Example Create a box plot of the following data 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 Histograms A Histogram is made by grouping data into bins and plotting the frequency or relative frequency of members in each bin versus the bin values. Histograms: An Example Create a histogram of the following data 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 Histograms: An Example Create a histogram of the following data 312; 318; 320; 331; 342; 344; 345; 349; 350; 390 Box and Whisker Plots: An Example -1.96; -.814; 1.86; 1.96; 0.519; 0.739; -0.540; 0.702; 0.663; 0.591; 0.580; 0.475; 0.589; -1.33; 0.420; -0.460; -0.482; 1.58; 0.778; 0.530; -0.507; -0.233; -0.195; 0.193; -0.136 Histogram: An Example e.g. -1.96; -.814; 1.86; 1.96; 0.519; 0.739; -0.540; 0.702; 0.663; 0.591; 0.580; 0.475; 0.589; -1.33; 0.420; -0.460; -0.482; 1.58; 0.778; 0.530; -0.507; -0.233; -0.195; 0.193; -0.136 Comparing the Visuals Practice Draw the box plot and histogram for the sample 1; 1.1; 1.2; 2; 2.2; 3; 4.1; 5; 7; 7.1; 10; Skewness I In our last two samples, the mean and median were the same. Skewness I In our last two samples, the mean and median were the same. I This is not always the case. Skewness I In our last two samples, the mean and median were the same. I This is not always the case. I 1; 1; 2; 3; 5; 7; 10; Skewness I In our last two samples, the mean and median were the same. I This is not always the case. I 1; 1; 2; 3; 5; 7; 10; I The median is x̃ = 3. Skewness I In our last two samples, the mean and median were the same. I This is not always the case. I 1; 1; 2; 3; 5; 7; 10; I The median is x̃ = 3. I The sample mean is x= 1 + 1 + 2 + 3 + 5 + 7 + 10 ≈ 4.14 7 Skewness I If the mean exceeds the median then the sample is skewed to the right. Skewness I If the mean exceeds the median then the sample is skewed to the right. I If the median exceeds the mean then the sample is skewed to the left. Skewness I If the mean exceeds the median then the sample is skewed to the right. I If the median exceeds the mean then the sample is skewed to the left. I x > M ⇒ skewed to the right x < M ⇒ skewed to the left