Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Methods for a Single Numeric Variable – Descriptive Statistics In this set of notes we will look at the various measures used to summarize a single numeric variable. For convenience, we’ll label the observations of the data set x1, x2, …, xn. That is x1 is the first measurement, x2 the second measurement, etc. Let n represent the total number of data points. We will look at how to summarize the data with respect to the following: _______________ of the data _______________ (or variability) of the data Shape/_______________ of the data Measures of Location Mean: The arithmetic average of all the values in the data set. This quantity measures the center of the data set. n x Sample mean = x = i 1 i n Note: the population mean is denoted by µ. Median: The middle observation in a data set (after the values have been arranged in ascending or descending order). The median cuts off the 50th percentile of the data so that half the observations fall below the median and the other half above the median. If the data set contains an even number of observations, the median is the average of the middle two observations. This quantity measures the center of the data set. Quartiles: o Q1 – The median of the lower half of the data, ________ percentile. o Q2 – The median, ________ percentile. o Q3 – The median of the upper half of the data, ________ percentile. Example: From 1947 to 1971 DDT was manufactured in a plant located on Indian Creek which flowed into the Tennessee River 321 miles from the mouth of the river. In 1972 the EPA banned the use of DDT in the United States. In the late 1970’s widespread DDT contamination was discovered at the plan sire and in nearby waterways. The data from a study conducted by the U.S. Army Corps of Engineers in the summer of 1980 can be found in the file Catfish.jmp on the course website. The variables in the data set are given below: Fish ID – an identification number for each fish (1 – 44) 1 Location – the location on the river from which the fish was sampled. FCM5 = Flint Creek 5 miles from the mouth LCM3 = Limestone Creek 3 miles from the mouth SCM1 = Spring Creek 1 mile from mouth In general: TRM### = Tennessee River### miles from the mouth Distance from mouth – approximate distance of the sample location from the mouth of the Tennessee River Species – fish species (catfish, smallmouth buffalo, largemouth bass) Length – length of the sampled fish (in cm) Weight – the weight of the sampled fish (in g) DDT – the concentration of DDT found in a fillet of fish (in ppm) A portion of the data set is given below. We can use JMP to calculate measures of location for the variables Length, Weight, and DDT. To do so, choose Analyze Distribution and put all three variables in the Y, Columns box as shown below. 2 Click OK and JMP will return the following: For Length: For Weight: For DDT: Identify the following from the JMP output: The number of observations in the data set: _______________ Sample mean for DDT: _______________ Median for DDT: _______________ Q1 for Length: _______________ Q2 for Length: _______________ Q3 for Length: _______________ The smallest observation for Weight: _______________ The largest observation for Weight: _______________ 3 Measures of Variability or Spread Consider the following data sets. Questions: 1. What is the mean for each data set? The median? Data set A Data set B Data set C Mean Median 2. Is a measure of center enough to describe these data sets? If not, what else do you think should be used? 4 There are several measures of variability or spread of a data set. Range: The difference between the __________________ and __________________ measurements in the data set. Range = ___________________________________ Interquartile Range (IQR): The difference between the _________ and __________ quartiles. IQR = __________________ Average Distance from the Mean: To summarize the variability in a set of measurements, we may want to use every observation in the data set to calculate the “average distance from the mean.” Average Distance from Mean = x i x n Calculate the average distance from the mean for Data set B from above. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances Average distance from mean Distance Questions: 3. What is the problem with using this method? 4. It can be shown using a little algebra that we will always get zero for an answer. Do you have any ideas as to how to overcome this problem? 5 Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using absolute difference. Compute the MAD for Data set B from above. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances MAD Absolute Distance Although this gives a valid measure of variability in a data set, this quantity has difficult statistical properties. Traditionally the ____________________ and __________________ are used instead. Variance: The average _______________ distance from the mean. n Sample variance = s2 = x x i 1 2 i n 1 Compute the sample variance for Data set B. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances Sample variance Squared Distance Standard Deviation: The _____________ square root of the variance. n x Sample standard deviation = s = i i 1 x 2 n 1 Compute the sample standard deviation for Data set B. 6 Coefficient of Variation: This measures the amount of variation relative to the size of the mean CV = s x 100% x Example: We can obtain the range, sample variance, sample standard deviation, and coefficient of variation in JMP for the variables Length, Weight, and DDT. You should already have the results from selecting Analyze Distribution open. Now, click the little red arrow next to each of the variables and choose Display Options Customize Summary Statistics. Then check the boxes next to Variance and CV. For Length: For Weight: For DDT: Identify the following from the JMP output: Range for Length: ___________________ IQR for Weight: ___________________ Variance for DDT: ___________________ Standard deviation for DDT: ___________________ Coefficient of variation for both Length and Weight: ____________________________________ 7 Describing the Shape/Distribution of the Data Determining the shape/distribution of the data is a very important step in many statistical procedures. For example, some procedures require the distribution of the data be bell-shaped. Most often, graphical techniques are used to determine the shape of the distribution, however, a few numerical measures exist and will be discussed later. Graphical Summaries for Shape We can use many different types of graphical summaries to describe the shape (or distribution) of the observed data. Comments: When plotting numeric data, the __________________ axis is a number line of values (i.e. CONTINUOUS!). The _________________ axis usually represents counts or sometimes the relative frequency of observations which have the same value. We’ll again use the file Catfish.jmp to introduce several graphical techniques for numerical data. Dotplot: o __________ data point is plotted. o Dotplots are normally used for small data sets. o JMP does not create dotplots, but I’ve created one of the variable Length using a different software package. Questions: 1. Where are most the fish located in terms of Length? 2. Would you consider any of the fish as extreme in terms of their Length? That is, would you consider any of the fish as potential outliers? Explain. 8 Stem and Leaf Plots o Again, every data point is plotted when creating a stem and leaf plot. o Stem and Leaf plots are normally used for small data sets. o JMP will produce a Stem and Leaf plot by clicking on Analyze Distribution, put pH in the Y, Columns box and then click on the little red arrow next to pH in the output. Choose Stem and Leaf from the menu that appears. You should get the following plot. Comments: o The “leaf” always represents the last digit in the values recorded. o The “stem” represents all the other decimal places in the values recorded. o You’ll notice under the stem and leaf plot it says “1|8 represents 18.” This is the legend which tells what the stem and leaf units are for that particular graph. In this case the stem is the tens place and the leaf is the ones place. Boxplot: o The boxplot creates a picture of the data using the ______________ as reference points. o The “box” portion is comprised of _____, _____ and _____. o The “whiskers” represent one of two things: The endpoint of the lower whisker is the larger of: _____________ or _______________________ The endpoint of the upper whisker is the smaller of: _____________ or _______________________ o Any measure beyond the endpoint of either _________________ is classified as a potential ____________________________________ observation. o An outlier boxplot is the other default plot plots produced when you choose Analyze Distribution in JMP. The boxplot for the Length data is given below. Next, consider the outlier boxplot for DDT. What do you see in this plot? 9 Histograms o This is a good type of plot when you have a lot of observations. o The observations are placed into “bins” and the height of each bin represents the number of observations that fall into any particular bin. o The histogram is one of the default plots produced when you choose Analyze Distribution in JMP. The histogram of the Length data is given below. Smoothed Histograms: Changing the number of classes in a histogram may influence your perception of the shape or distribution of the data. Therefore, it is good practice to use JMP to carry out a process called smoothing. Click on the red drop-down arrow next to Length and select Continuous Fit Smooth Curve. JMP should return the following plot. This smooth curve represents JMP’s best guess for the shape or distribution of the ___________ From which the data is a random sample. That is, the smooth curve represents the general trends, not just the patterns that are specific to the data which were collected. Numerical Summaries for Shape/Distribution Two numerical summaries for shape exist: ____________________ and ___________________ Skewness: A data distribution is said to be ___________________ if it has the same shape on both sides of the center. Skewness measures the amount of ___________________. o The distribution of a set of data is said to be __________ skewed or _______________ skewed if the measurements tend to trail off to the __________. o Similarly, the distribution is said to be _________ skewed or ______________ if the measurements tend to trail off to the __________. 10 Shape Picture Skewness Measure in JMP The most famous symmetric distribution is the normal: Symmetric Zero Others? Right Skewed Greater than zero Left Skewed Less than zero 11 Kurtosis: This is used to measure the amount of _______________ of the distribution of the data relative to the normal distribution. Shape Picture Normal Kurtosis Measure in JMP zero Taller or skinner than normal shape Positive Kurtosis Greater than zero Less than zero Negative Kurtosis 12 These values can be found using JMP by choosing Display Options Customize Summary Statistics and checking Skewness and Kurtosis. You should get the following output. For Length: For Weight: For DDT: Also, shown are the smoothed histograms for each variable. Length: Weight: DDT: Questions: 3. Based on the histograms, how would you describe the shape/distribution for: a. Length: ____________________________ b. Weight: ____________________________ c. DDT: ____________________________ 4. Doe the numerical measures of skewness agree with you see in the histograms? Explain. 13 5. If the data are extremely right skewed, which should be larger, the mean or median? Explain. 6. If the data are extremely left skewed, which should be larger, the mean or median? Explain. 7. If the data are symmetric, which should be larger, the mean or median? Explain. 8. Which summary statistic do you think is more representative of a typical DDT measurement, the mean or median? Explain. 14 Measuring the Position of an Observation There are two commonly used methods for determining an observation’s position relative to all other measurements in the data set. Percentiles: The ______ percentile for a set of measurements is a number such that _____ of the measurements fall at or below the pth percentile. Z-score: This measures how many standard deviations can observation is away from the mean. Sometimes it is called the ________________________ value. z-score = observation - mean standard deviation Example: Once again, let’s consider the Length variable from the Catfish.jmp data set. Questions: 9. What is the 50th percentile? The 25th percentile? The 75th percentile? 10. Identify the 10th and 90th percentiles. What percent of the observations lie between these two values? Example: On a related note, we can also create CDF plots in JMP. This shows the estimated probability of observing a data point less than or equal to a given value. In JMP, select the red drop-down arrow next to Length and choose CDF Plot. You should get the following plot. 15 Questions: 11. Estimate the probability of observing a randomly selected fish that is less than 45cm in length. 12. Estimate the probability of observing a randomly selected fish that is less than 50cm in length. Example: To obtain z-scores for the measurements of the variables Length, Weight and DDT, select Save Standardized from the red drop-down arrow next to the variable name. You should then see the following output in the data table. Question: 13. Using the first observation show how the z-score for length was calculated. Interpretation of z-scores The standardized values transform the data so that the data is placed on the standardized scale. The standardized scale has a mean of _____ and a standard deviation of _____. The smallest value in the data set will always have the smallest z-score. Likewise, the largest value in the data set will have the largest z-score. If a z-score is _______________, then the data point is that many standard deviations below the mean. If a s-score is _______________, then the data point is that many standard deviations above the mean. 16 If the z-score is ______________, then the data point is the same as the mean. If the standard deviation is ________, then the z-score is NOT defined and thus cannot be computed. The following graphic compares the data on its original scale to the data on the standardized scale. Questions: 14. What changes between the two graphs? 15. Why do you think z-scores are so important? Example: Which is more extreme…a catfish 44.5 cm long or a smallmouth buffalo 43.5 cm long? 17 Identification of Outliers There are two basic methods for identifying outliers: ________________ and ________________ Boxplots: As we have already seen, these are commonly used to identify outliers. Recall that any measurement beyond the endpoint of either whisker is classified as a potential outlier (extreme observation). Z-scores: Z-scores are also used to identify outliers. Any data value whose z-score is below -2 or above 2 is considered to be a potential outlier. Any data value whose z-score is below -3 or above 3 is considered an outlier and warrants further investigation. Rules for Data Concentration Once you have estimated the mean and standard deviation for a set of measurements, you can utilize a few rules to make statements about where the data is concentrated. Empirical Rule: If the distribution of the data is _________________________ and symmetric, then the Empirical Rule applies. This rule says that APPROXIMATELY… o ________ of the values fall within one standard deviation of the mean. o ________ of the values fall within two standard deviations of the mean. o ________ of the values fall within three standard deviations of the mean. Chebyshev’s Rule: This rule works for ANY distribution. Chebyshev’s Rule tells us that AT LEAST… o ________ of the values fall within two standard deviations of the mean. o ________ of the values fall within three standard deviations of the mean. o ________ of the values fall within k standard deviations of the mean. 18