Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4: Numerical Methods for Describing Data (How to CUSS) 4.1 Describing the Center of a Data Set Mean – the arithmetic average (sum of the observations divided by the number of observations.) It is the balance point (fulcrum) of a set of data. x = variable for which we have sample data n = number of observations in the sample x1 = first observation x2 = second observation xn = the nth (last) observation sample mean = (denoted by X ) sum of all observations in sample = x1 + x2 + … + xn = ∑ x number of observations in sample n n population mean = (denoted by μ) the average of all x values in the entire population. Median – divides the data in half. When the data is in numerical order it is the middle value (or the average of the middle two.) Sample median – median of the n observations of a population. Trimmed mean – found by ordering the data from smallest to largest, and deleting a selected number of values from each end and averaging the remaining values. number deleted from each end = (trimming percentage) · n trimming Percentage – percentage of values deleted from each end of the ordered list. Trimming percentage = number deleted from each end · 100 n Comparing Mean and Median Symmetric data – mean = median Unimodal and positively skewed – mean greater than median Unimodal and negatively skewed – mean less than median Categorical data – is described using relative frequencies. If the are only two categories (dichotomy) then they can be labeled as S (success) and F (Failure). p = sample proportion of successes (denoted by p) - the fraction of S’s in the sample = number of S’s n л = used to denote the population proportion of successes 4.2 Describing Variability (Spread) in a Data Set range – largest observation – smallest observation deviations from the sample mean – differences in the sample mean ( X ) and each observation x1 - X , x2 - X , … , xn - X Except for rounding, ∑ (x - X ) = 0 average squared deviation (denoted by Sxx = ∑ (x - X )2 sample variance (denoted by s2) - uses square deviations to prevent positive and negative deviations form counteracting one another s 2 x X n 1 2 S xx n 1 sample standard deviation – (denoted by s) is the positive square root of the sample variance. s and s2 are based on (n - 1) degrees of freedom population variance – denoted by σ2 population standard deviation – denoted by σ Interquartile Range (iqr) – a measure of variability that is resistant to outliers (unlike the sample mean) and it is based on quartiles. iqr = upper quartile – lower quartile lower quartile – median of the lower half of the sample, separates the bottom 25% of the sample from the upper 75%. upper quartile – median of the upper half of the sample, separates the top 25% of the sample from the lower 75%. population quartile range – difference between the upper and lower population quartiles. **If the histogram of the data set is approximated by a normal curve, then the standard deviation (sd) and interquartile range are related by: sd = iqr/1.35 If the sd is much larger than iqr/1.35, then the histogram is heavy-tailed. Computer printout of summary statistics (type you will see on AP Test) N 80 Mean 4.58 Median 4.04 TrMean 4.37 StDev 2.99 SEMean 0.33 Min 0.46 Max 13.63 Q1 2.15 Q3 6.56 TrMean=Trimmed Mean SEMean=standard error of mean 4.3 Summarizing a Data Set: Boxplots Boxplot – method of summarizing a data set that provides symmetry or skewness in addition to the center and spread. Outlier – observation that is more than 1.5(iqr) away from the nearest quartile (nearest end of the box) Mild outlier – less than 3(iqr) Extreme outlier – observation that is more than 3(iqr) from the nearest quartile (nearest end of the box) Two types of boxplots: Skeletal boxplot - disregards the presence of any mild or extreme outliers Draw a horizontal (or vertical) measurement scale Construct a rectangular box with a left edge at the lower quartile and a right edge at the upper quartile. The box width is then equal to the iqr. Draw a vertical (or horizontal) segment inside the box at the median. Extend horizontal (or vertical) line segments (called whiskers) from each end of the box to the smallest and largest observations in the data set. Five-number summary – smallest observation, lower quartile, median, upper quartile, largest observation Modified boxplot – makes adjustments for any mild or extreme outliers Draw a horizontal (or vertical) measurement scale Construct a rectangular box with a left edge at the lower quartile and a right edge at the upper quartile. The box width is then equal to the iqr. Draw a vertical (or horizontal) segment inside the box at the median. Determine if there are any mild or extreme outliers in the data set. Draw whiskers that extend from each end of the box to the most extreme observation that is not an outlier. Draw a solid circle to mark the location of any mild outliers in the data set. Draw an open circle to mark the location of any extreme outliers in the data set. Comparative boxplot – more than one boxplot using the same scale to convey impressions concerning similarities and differences between data sets. 4.4 Interpreting Center and Variability Chebyshev’s Rule – Allows us to get a sense of the distribution of data without a graphical display by using only the mean and standard deviation. It is applicable to any data set, whether symmetric or skewed, but we must be careful about making statements about the proportion above a particular value. Also, while the rule states that “at least” a certain percentage is within a number of standard deviations, often there are substantially more, thus the rule is rather conservative. Consider any number k, where k > 1. Then the percentage of observations that are within k standard deviations of the mean is at least 100( 1 - 1/k2)%. Substituting selected values of k gives the following results: 1 – 1/k2 1 - ¼ = .75 % w/in k std. dev. of the mean at least 75% 3 1 – 1/9 = .89 at least 89% 4 1 – 1/16 = .94 at least 94% 4.472 1 – 1/20 = .95 at least 95% 5 1 – 1/25 = .96 at least 96% 10 1 – 1/100 = .99 at least 99% # of std. dev. 2 Empirical Rule – Is more precise and less conservative than Chebyshev’s Rule and can be applied whenever the distribution of data values can be reasonably well described by a normal curve. Approximately 68% of the observations are within 1 standard deviation of the mean. Approximately 95% of the observations are within 2 standard deviations of the mean. Approximately 99.7% of the observations are within 3 standard deviations of the mean. z score – measure of the position of a particular value in a data set relative to all values in the set. z score = value – mean standard deviation (this process is sometimes called standardization and gives what is often called a standardized score.) 4.5 Interpreting and Communicating the Results of Statistical Analysis Reporting the results of data analysis Start with descriptive information about the variables of interest Make a graphical display of the data Describe the center, shape, and spread of the data using measures If the data is fairly symmetric o Sample mean and standard deviation o Sample mean and interquartile range If the data is noticeably skewed o Five-number summary What to look for in published data – questions to ask yourself Is the chosen summary measure appropriate for the type of data collected? In particular, watch for inappropriate use of mean and standard deviation with categorical data that has been code numerically. How do the mean and median compare, or if only one is given, was the appropriate measure used? Is the standard deviation large or small, and what does it tell you about the variable being summarized? Can anything be said about the values by applying Chebyshev’s Rule of the Empirical Rule? Cautions and Limitations Measure of center don’t tell all? Must be used with information about variability and shape. Data distributions with different shapes can have the same mean and standard deviation. Both mean and standard deviation are sensitive to extreme values, especially if the data set is small. Measures of center and variability describe the values of the variables studied, not the frequencies Be careful when interpreting boxplots based on small sample sizes, symmetry or skewness cannot be determined from small sample sizes. Not all distributions are normal, be cautious about using the Empirical Rule when the distribution is not approximately normal. Watch out for outliers!