Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Methods for a Single Numeric Variable – Descriptive Statistics So far this semester we’ve been concentrating on categorical variables/data. We are now going to discuss how to summarize and describe numeric data, as well as inferential procedures. Measures of Center We can use the following methods to describe the center or distribution of a given data set. x Mean: The arithmetic mean of the observations. Sample mean = x Median: The middle number in a data set after the numbers have been arranged in ascending (or descending) order. This is the value which cuts off the 50th percentile of the observations, i.e. 50% of the data values like above the median and 50% of the data values lie below the median. n Example: Two researchers measured the pH (a scale on which a value of 7 is neutral and values below 7 are acidic) of water collected from rain and snow over a 6-month period in Allegheny County, PA. The data can be found on the course website in the file pH.jmp. We can use JMP to find the mean and median of the data. 1. Click on Analyze Distribution. Then place pH in the Y, Columns box as shown below. 2. You should then get the following output: The mean and median are circled in the above output. 1 In addition to the mean and median, quartiles (or percentiles or quantiles) give additional information regarding the distribution of the data. Q1: The first quartile which represents the _______ percentile. This is also the median of the lower half of the data. Q2: The second quartile which is the _______________. Q3: The third quartile which represents the _______ percentile. This is also the median of the upper half of the data. Questions: 1. Referring to the JMP output above, identify the values for Q1 and Q3. Q1 = ________ Q3 = ________ Note: There are a number of other percentiles listed in the JMP output. 2. What values do the 0th and 100th percentiles represent? 0th percentile = __________ 100th percentile = __________ 3. Together the ________, ________, ________, ________, and ________ form what’s called the Five Number Summary. The Five Number Summary provides a numeric “picture” of the distribution of the data. 4. What percent of the observations should fall between the 25th and 75th quartiles? 5. What about the 2.5th and 97.5th percentiles? 6. How about the 0.5th and 99.5th percentiles? 2 Measures of Variability or Spread Consider the following data sets. Questions: 7. What is the mean for each data set? The median? Data set A Data set B Data set C Mean Median 8. Is a measure of center enough to describe these data sets? If not, what else do you think should be used? 3 There are several measures of variability or spread of a data set. Range: The difference between the __________________ and __________________ measurements in the data set. Range = ___________________________________ Interquartile Range (IQR): The difference between the _________ and __________ quartiles. IQR = __________________ Average Distance from the Mean: To summarize the variability in a set of measurements, we may want to use every observation in the data set to calculate the “average distance from the mean.” Average Distance from Mean = x i x n Calculate the average distance from the mean for Data set B from above. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances Average distance from mean Distance Questions: 9. What is the problem with using this method? 10. It can be shown using a little algebra that we will always get zero for an answer. Do you have any ideas as to how to overcome this problem? 4 Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using absolute difference. Compute the MAD for Data set B from above. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances MAD Absolute Distance Although this gives a valid measure of variability in a data set, this quantity has difficult statistical properties. Traditionally the ____________________ and __________________ are used instead. Variance: The average _______________ distance from the mean. n Sample variance = s2 = x x i 1 2 i n 1 Compute the sample variance for Data set B. Observation Sample Mean 0 20 10 20 20 20 30 20 40 20 Sum of distances Sample variance Squared Distance Standard Deviation: The _____________ square root of the variance. n x Sample standard deviation = s = i i 1 x 2 n 1 Compute the sample standard deviation for Data set B. 5 Example: Looking again at the pH data, compute the Range, IQR, Sample variance, and sample standard deviation using the JMP output. Range IQR Sample standard deviation (s) Sample variance (s2) Example: Download the file messages.jmp from the course website. This data set contains the number of text messages an individual sends which was collected from the student data survey you may have completed at the beginning of the semester. Answer the following questions regarding the data set. Questions: 11. Compute the average number of text messages sent in a day? 12. Find the Five Number Summary for the number of text messages sent in a day. 13. Compute the range for the number of text messages send in a day. 14. Between what two values does the middle 50% of text messages sent in a day lit? 15. Compute the sample standard deviation and sample variance for the number of text messages sent in a day. 6 We’ve been looking at numerical summaries used to describe a single numeric variable. We will now look at the various methods to graphically summarize these types of variables. Description of Shape We can use many different types of graphical summaries to describe the shape (or distribution) of the observed data. Comments: When plotting numeric data, the __________________ axis a number line of values (i.e. CONTINUOUS!). The _________________ axis usually represents counts or sometimes the relative frequency of observations which have the same value. We will again use the file pH.jmp to discuss the various graphical techniques for describing the shape (or distribution) of the observed data. Dotplot o _________ data point is plotted when creating a dotplot. o Dotplots are normally used for small data sets. o JMP does not create dotplot, but we’ve encountered them in Tinkerplots earlier this semester. Stem and Leaf Plots o Again, every data point is plotted when creating a stem and leaf plot. o Stem and Leaf plots are normally used for small data sets. o JMP will produce a Stem and Leaf plot by clicking on Analyze Distribution, put pH in the Y, Columns box and then click on the little red arrow next to pH in the output. Choose Stem and Leaf from the menu that appears. You should get the following plot. 7 Comments: o The “leaf” always represents the last digit in the values recorded. o The “stem” represents all the other decimal places in the values recorded. o You’ll notice under the stem and leaf plot it says “41|2 represents 4.12.” This is the legend which tells what the stem and leaf units are for that particular graph. In this case the stem is the ones and tenths place and the leaf is the hundredths place. Histograms o This is a good type of plot when you have a lot of observations. o The observations are placed into “bins” and the height of each bin represents the number of observations that fall into any particular bin. o The histogram is one of the default plots produced when you choose Analyze Distribution in JMP. The histogram of the pH data is given below. When looking at a dotplot, stem and leaf plot, or histogram of the data, we can describe the shape/distribution of the data using the following terminology. o Right Skewed/Positively Skewed 8 o Left Skewed/Negatively Skewed o Symmetric Questions: 16. Describe the shape/distribution of the pH data. 17. Does the information given in the histogram agree with what was seen in the dotplot? Boxplot o The boxplot creates a picture of the data using the ______________ as reference points. o The “box” portion is comprised of _____, _____ and _____. o The “whiskers” represent one of two things: The endpoint of the lower whisker is the larger of: _____________ or _______________________ The endpoint of the upper whisker is the larger of: _____________ or _______________________ o Any measure beyond the endpoint of either _________________ is classified as a potential ____________________________________ observation. o An outlier boxplot is the other default plot plots produced when you choose Analyze Distribution in JMP. The boxplot for the pH data is given below. 9 Numerical Measures for Shape There are two numerical summaries for shape that exist: _______________ and _______________. Skewness o A data distribution is said to be symmetric if it has the same shape on both sides of the center of the distribution. Skewness is a measure of __________________. Shape Picture Skewness Measure in JMP The most famous symmetric distribution is the normal: Symmetric Zero Others? Right Skewed Greater than zero Left Skewed Less than zero 10 Kurtosis o This is used to measure the amount of _________________ in the distribution of the data relative to the normal distribution. Shape Picture Normal Kurtosis Measure in JMP zero Taller or skinner than normal shape Positive Kurtosis Greater than zero Less than zero Negative Kurtosis 11 JMP will display both the skewness and kurtosis values by clicking on the red drop-down arrow next to pH and choosing Display Options Customize Summary Statistics and checking Skewness and Kurtosis. You should then get the following output. Questions: 18. How did we describe the shape/distribution of the pH data in Question 16? 19. Does the numerical measure for skewness agree with this? Explain. 20. If the data were extremely right skewed, which should be larger: the mean or the median? Explain why this is the case. 21. If the data were extremely left skewed, which should be larger: the mean or the median? Explain why this is the case. 22. If the data were symmetric, which should be larger: the mean or the median? Explain why this is the case. 12 Example: Again, let’s look at the text messaging data set from the course website. Questions: 23. Using JMP, create a histogram for the number of test messages sent in a day. 24. Looking at the histogram created in Question 23 describe the shape/distribution for the number of text messages sent in a day. 25. Looking at the boxplot created in JMP, is there any evidence of potential outliers? Explain. 26. Give the values for skewness and kurtosis. 27. Does the value for skewness agree with your answer to Question 24? Example: Consider two populations in the same state, where both populations are the same size. Population 1 consists of all students at the state university. Population 2 consists of all residents in a small town. Consider the variable age. Which population would most likely have the larger standard deviation? Explain. Example: A test is given to 100 students, and the median score is determined. After grading the test, the instructor realizes that the 10 students with the highest scores did exceptionally well. The instructor decides to award these 10 students a bonus of five additional points. How will the median of the new score distribution change compared to that of the original distribution? Explain. 13 Example: The following histogram shows the distribution of the ages of male Oscar winners. 28. Which boxplot is graphing the same data as the histogram? Explain. a. c. b. d. 14 Example: Four histograms are presented below. Each histogram displays the quiz scores on a scale of 0 to 10 for one of four different STAT 110 classes. 29. Which of the classes would you expect to have the smallest standard deviation? Explain. 30. Which of the classes would you expect to have the largest standard deviation? Explain. 15