Download Methods for a Single Numeric Variable – Descriptive Statistics So far

Methods for a Single Numeric Variable – Descriptive Statistics So far this semester we’ve been concentrating on categorical variables/data. We are now going to discuss how to summarize and describe numeric data, as well as inferential procedures. Measures of Center We can use the following methods to describe the center or distribution of a given data set. x  Mean: The arithmetic mean of the observations. Sample mean = x   Median: The middle number in a data set after the numbers have been arranged in ascending (or descending) order. This is the value which cuts off the 50th percentile of the observations, i.e. 50% of the data values like above the median and 50% of the data values lie below the median. n Example: Two researchers measured the pH (a scale on which a value of 7 is neutral and values below 7 are acidic) of water collected from rain and snow over a 6-month period in Allegheny County, PA. The data can be found on the course website in the file pH.jmp. We can use JMP to find the mean and median of the data. 1. Click on Analyze  Distribution. Then place pH in the Y, Columns box as shown below. 2. You should then get the following output: The mean and median are circled in the above output. 1 In addition to the mean and median, quartiles (or percentiles or quantiles) give additional information regarding the distribution of the data.  Q1: The first quartile which represents the _______ percentile. This is also the median of the lower half of the data.  Q2: The second quartile which is the _______________.  Q3: The third quartile which represents the _______ percentile. This is also the median of the upper half of the data. Questions: 1. Referring to the JMP output above, identify the values for Q1 and Q3. Q1 = ________ Q3 = ________ Note: There are a number of other percentiles listed in the JMP output. 2. What values do the 0th and 100th percentiles represent? 0th percentile = __________ 100th percentile = __________ 3. Together the ________, ________, ________, ________, and ________ form what’s called the Five Number Summary. The Five Number Summary provides a numeric “picture” of the distribution of the data. 4. What percent of the observations should fall between the 25th and 75th percentiles? 5. What about the 2.5th and 97.5th percentiles? 6. How about the 0.5th and 99.5th percentiles? 2 Measures of Variability or Spread Consider the following data sets. Questions: 7. What is the mean for each data set? The median? Data set A Data set B Data set C Mean Median 8. Is a measure of center enough to describe these data sets? If not, what else do you think should be used? 3 There are several measures of variability or spread of a data set.  Range: The difference between the __________________ and __________________ measurements in the data set. Range = ___________________________________  Interquartile Range (IQR): The difference between the _________ and __________ quartiles. IQR = __________________  Average Distance from the Mean: To summarize the variability in a set of measurements, we may want to use every observation in the data set to calculate the “average distance from the mean.” Average Distance from Mean =  x i  x n Calculate the average distance from the mean for Data set B from above. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances Average distance from mean Distance Questions: 9. What is the problem with using this method? 10. It can be shown using a little algebra that we will always get zero for an answer. Do you have any ideas as to how to overcome this problem? 4  Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using absolute difference. Compute the MAD for Data set B from above. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances MAD Absolute Distance Although this gives a valid measure of variability in a data set, this quantity has difficult statistical properties. Traditionally the ____________________ and __________________ are used instead.  Variance: The average _______________ distance from the mean. n Sample variance = s2 =  x  x  i 1 2 i n 1 Compute the sample variance for Data set B. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances Sample variance  Squared Distance Standard Deviation: The _____________ square root of the variance. n  x Sample standard deviation = s = i i 1  x 2 n 1 Compute the sample standard deviation for Data set B. 5 Example: Looking again at the pH data, compute the Range, IQR, Sample variance, and sample standard deviation using the JMP output. Range IQR Sample standard deviation (s) Sample variance (s2) Example: Download the file messages.jmp from the course website. This data set contains the number of text messages an individual sends which was collected from the student data survey you may have completed at the beginning of the semester. Answer the following questions regarding the data set. Questions: 11. Compute the average number of text messages sent in a day? 12. Find the Five Number Summary for the number of text messages sent in a day. 13. Compute the range for the number of text messages sent in a day. 14. Between what two values does the middle 50% of text messages sent in a day lie? 15. Compute the sample standard deviation and sample variance for the number of text messages sent in a day. 6 We’ve been looking at numerical summaries used to describe a single numeric variable. We will now look at the various methods to graphically summarize these types of variables. Description of Shape We can use many different types of graphical summaries to describe the shape (or distribution) of the observed data. Comments:  When plotting numeric data, the __________________ axis a number line of values (i.e. CONTINUOUS!).  The _________________ axis usually represents counts or sometimes the relative frequency of observations which have the same value. We will again use the file pH.jmp to discuss the various graphical techniques for describing the shape (or distribution) of the observed data.  Dotplot o _________ data point is plotted when creating a dotplot. o Dotplots are normally used for small data sets. o JMP does not create dotplot, but we’ve encountered them in Tinkerplots earlier this semester.  Stem and Leaf Plots o Again, every data point is plotted when creating a stem and leaf plot. o Stem and Leaf plots are normally used for small data sets. o JMP will produce a Stem and Leaf plot by clicking on Analyze  Distribution, put pH in the Y, Columns box and then click on the little red arrow next to pH in the output. Choose Stem and Leaf from the menu that appears. You should get the following plot. 7 Comments: o The “leaf” always represents the last digit in the values recorded. o The “stem” represents all the other decimal places in the values recorded. o You’ll notice under the stem and leaf plot it says “41|2 represents 4.12.” This is the legend which tells what the stem and leaf units are for that particular graph. In this case the stem is the ones and tenths place and the leaf is the hundredths place.  Histograms o This is a good type of plot when you have a lot of observations. o The observations are placed into “bins” and the height of each bin represents the number of observations that fall into any particular bin. o The histogram is one of the default plots produced when you choose Analyze  Distribution in JMP. The histogram of the pH data is given below. When looking at a dotplot, stem and leaf plot, or histogram of the data, we can describe the shape/distribution of the data using the following terminology. o Right Skewed/Positively Skewed 8 o Left Skewed/Negatively Skewed o Symmetric Questions: 16. Describe the shape/distribution of the pH data. 17. Does the information given in the histogram agree with what was seen in the dotplot? 18. If the data were extremely right skewed, which should be larger: the mean or the median? Explain why this is the case. 19. If the data were extremely left skewed, which should be larger: the mean or the median? Explain why this is the case. 20. If the data were symmetric, which should be larger: the mean or the median? Explain why this is the case. 9  Boxplot o The boxplot creates a picture of the data using the ______________ as reference points. o The “box” portion is comprised of _____, _____ and _____. o The “whiskers” represent one of two things:  The endpoint of the lower whisker is the larger of: _____________ or _______________________  The endpoint of the upper whisker is the smaller of: _____________ or _______________________ o Any measure beyond the endpoint of either _________________ is classified as a potential ____________________________________ observation. o An outlier boxplot is the other default plot plots produced when you choose Analyze  Distribution in JMP. The boxplot for the pH data is given below. Example: Again, let’s look at the text messaging data set from the course website. Questions: 21. Using JMP, create a histogram for the number of test messages sent in a day. 22. Looking at the histogram created in Question 21 describe the shape/distribution for the number of text messages sent in a day. 23. Looking at the boxplot created in JMP, is there any evidence of potential outliers? Explain. Example: Consider two populations in the same state, where both populations are the same size. Population 1 consists of all students at the state university. Population 2 consists of all residents in a small town. Consider the variable age. Which population would most likely have the larger standard deviation? Explain. 10 Example: A test is given to 100 students, and the median score is determined. After grading the test, the instructor realizes that the 10 students with the highest scores did exceptionally well. The instructor decides to award these 10 students a bonus of five additional points. How will the median of the new score distribution change compared to that of the original distribution? Explain. Example: The following histogram shows the distribution of the ages of male Oscar winners. 24. Which boxplot is graphing the same data as the histogram? Explain. a. c. b. d. 11 Example: Four histograms are presented below. Each histogram displays the quiz scores on a scale of 0 to 10 for one of four different STAT 110 classes. 25. Which of the classes would you expect to have the smallest standard deviation? Explain. 26. Which of the classes would you expect to have the largest standard deviation? Explain. 12 Measuring the Position of an Observation There are two commonly used methods for determining an observation’s position relative to all other measurements in the data set.  Z-score: This measures how many standard deviations can observation is away from the mean. Sometimes it is called the ________________________ value. z-score = observation - mean standard deviation Example: From 1947 to 1971 DDT was manufactured in a plant located on Indian Creek which flowed into the Tennessee River 321 miles from the mouth of the river. In 1972 the EPA banned the use of DDT in the United States. In the late 1970’s widespread DDT contamination was discovered at the plan sire and in nearby waterways. The data from a study conducted by the U.S. Army Corps of Engineers in the summer of 1980 can be found in the file Catfish.jmp on the course website. The variables in the data set are given below:        Fish ID – an identification number for each fish (1 – 44) Location – the location on the river from which the fish was sampled.  FCM5 = Flint Creek 5 miles from the mouth  LCM3 = Limestone Creek 3 miles from the mouth  SCM1 = Spring Creek 1 mile from mouth  In general: TRM### = Tennessee River### miles from the mouth Distance from mouth – approximate distance of the sample location from the mouth of the Tennessee River Species – fish species (catfish, smallmouth buffalo, largemouth bass) Length – length of the sampled fish (in cm) Weight – the weight of the sampled fish (in g) DDT – the concentration of DDT found in a fillet of fish (in ppm) A portion of the data set is given below. Example: To obtain z-scores for the measurements of the variables Length, Weight and DDT, select Save  Standardized from the red drop-down arrow next to the variable name. You should then see the following output in the data table. 13 Question: 27. Using the first observation show how the z-score for length was calculated. Interpretation of z-scores  The standardized values transform the data so that the data is placed on the standardized scale. The standardized scale has a mean of _____ and a standard deviation of _____.  The smallest value in the data set will always have the smallest z-score. Likewise, the largest value in the data set will have the largest z-score.  If a z-score is _______________, then the data point is that many standard deviations below the mean.  If a s-score is _______________, then the data point is that many standard deviations above the mean.  If the z-score is ______________, then the data point is the same as the mean.  If the standard deviation is ________, then the z-score is NOT defined and thus cannot be computed. 14 The following graphic compares the data on its original scale to the data on the standardized scale. Questions: 28. What changes between the two graphs? 29. Why do you think z-scores are so important? Example: Which is more extreme…a catfish 44.5 cm long or a smallmouth buffalo 43.5 cm long? 15 Identification of Outliers There are two basic methods for identifying outliers: ________________ and ________________  Boxplots: As we have already seen, these are commonly used to identify outliers. Recall that any measurement beyond the endpoint of either whisker is classified as a potential outlier (extreme observation).  Z-scores: Z-scores are also used to identify outliers. Any data value whose z-score is below -2 or above 2 is considered to be a potential outlier. Any data value whose z-score is below -3 or above 3 is considered an outlier and warrants further investigation. Rules for Data Concentration Once you have estimated the mean and standard deviation for a set of measurements, you can utilize a few rules to make statements about where the data is concentrated.  Empirical Rule: If the distribution of the data is _________________________ and symmetric, then the Empirical Rule applies. This rule says that APPROXIMATELY… o ________ of the values fall within one standard deviation of the mean. o ________ of the values fall within two standard deviations of the mean. o ________ of the values fall within three standard deviations of the mean. 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Methods for a Single Numeric Variable – Descriptive Statistics So far