Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1: Looking at Data--Distributions Section 1.1: Introduction, Displaying Distributions with Graphs Section 1.2: Describing Distributions with Numbers Learning goals for this chapter: Identify categorical and quantitative variables. Interpret, create (by hand and with SPSS), and know when to use: bar graphs, pie charts, stemplots (standard, back-to-back, split), histograms, and boxplots (regular, modified, side-by-side). Describe the shape, center, and spread of data distributions. Define, calculate (by hand and with SPSS), and know when to use measures of center (mean vs. median) and spread (range, 5-number summary, IQR, variance, standard deviation). Understand what a resistant measure of center and spread is and when this is important. Use the 1.5IQR rule to look for outliers. Draw a Normal curve in correct proportions and identify the mean/median, standard deviation, middle 68%, middle 95%, and middle 99.7%. Perform calculations with the empirical rule, both backwards and forwards. Understand the need for standardization. Big picture: what do we learn in this chapter? Individuals vs. Variables Categorical vs. Quantitative Variables Graphs: Bar graphs and pie charts (categorical variables) Histograms and stemplots (quantitative variables—good for checking for symmetry and skewness) Boxplots (quantitative variables—graphical display of the 5 # summary, modified boxplots show outliers) Describing distributions Shape (symmetric/skewed, unimodal/bimodal/multimodal) Center (mean or median) Spread (usually standard deviation/variance or IQR from the 5 # summary) Outliers If you have a symmetric distribution with no outliers, use the mean and standard deviation. If you have a skewed distribution and/or you have outliers, use the 5 # summary instead. 1 2 components in describing data or information: Individuals: objects being described by a set of data (people, households, cars, animals, corn, etc.) Variables: characteristics of individuals (height, yield, length, age, eye color, etc.) Categorical: places an individual into one of several groups (gender, eye color, college major, hometown, etc.) Quantitative: Attaches a numerical value to a variable so that adding or averaging the values makes sense (height, weight, age, income, yield, etc.) Distribution of a variable: describes what values a variables takes and how often it takes those values If you have more than one variable in your problem, you should look at each variable by itself before you look at relationships between the variables. Example: Identify whether the following questions would give you categorical or quantitative data. a) What letter grade did you get in your Calculus class last semester? b) What was your score on the last exam? c) Who will you vote for in the next election? d) How many votes did George W. Bush get? e) How many red M&Ms are in this bag? f) Which type of M&Ms has more red ones: peanut or plain? It’s always a good idea to start by displaying variables graphically before you do any other statistical analysis. What kind of graph should you use? That depends on whether you have a categorical or quantitative variable. Categorical Variables: Bar graphs or pie charts Messy room example: In a poll of 200 parents of children ages 6 to 12, respondents were asked to name the most disgusting things ever found in their children’s rooms. The results are below (J&C 2005) Most disgusting thing Food-related # of parents 106 2 % of parents 53% Animal and insect-related nuisances Clothing (dirty socks and underwear especially) Other 22 11% 22 11% 50 25% Bar graph (can use either # of parents like below or % of parents): 120 100 Count 80 60 40 20 0 animal clothing food other type of disgusting mess Cases weighted by # of parents Pie chart (needs % of parents): type of disgusting mess animal clothing food other 11.0% animal 25.0% other 11.0% clothing 53.0% food Cases weighted by # of parents 3 Quantitative Variables: Stemplots, histograms, and boxplots (discussed a little later) Example: You investigate the amount of time students spend online (in minutes). You study 28 students, and their times are listed below. Show the distribution of times with a stemplot. 7 42 72 20 43 75 24 44 77 25 45 78 25 46 79 28 47 83 28 48 87 30 48 88 32 50 35 51 To create a stemplot by hand, 1. Put the data in order from smallest to largest. 2. The ―stem‖ will be all digits for a data point except for the last one. Write the stems in a vertical line. (Think of ―7‖ as being ―07‖ so that all the numbers have a digit in the tens place.) 3. The ―leaf‖ will be the next digit (in this case, the ones place) from each data point. Write the leaves after the appropriate stem, in increasing order. 4. It is possible to ―trim‖ any digits that you feel may be unnecessary. For example, if our second data point had been 20.3, we would probably choose to ignore the ―.3‖ for the purposes of the stemplot so that we could create a more reasonable stemplot. If we did not ignore this ―.3‖, then our stems would have been 07, 08, 09, 10, 11, 12, 13,…, 88 with decimal numbers as our leaves. This would show a very uniform stemplot with only one leaf for each stem (all leaves would be 0 except for the 3). This would not be helpful to us at all. It makes much more sense to use the tens place for the stem and the ones place as the leaves in this example. A split stemplot just has more stems. There are several ways to split the stems. Here they are split by fives. Stemplot 0 1 2 3 4 5 6 7 8 |7 | |045588 |025 |23456788 |01 | |25789 |378 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 4 |7 | | |04 |5588 |02 |5 |234 |56788 |01 | | | |2 |5789 |3 |78 Why do we need split stemplots? Sometimes it is easier to see the shape of the data with more stems. Sometimes a regular stemplot is better. If you’re not sure, try it both ways and see if a pattern appears. Try a stemplot and a split stemplot with this data (use the hundreds place for stems): 3, 4, 17, 18, 39, 93, 102, 110, 143, 178, 250, 278, 299, 300.1 Histograms Sorting the quantitative data into bins. How many bins? Not too many bins with either 0 or 1 counts. Not overly summarized so that you lose all the information Not so detailed that it is no longer a summary Too few bins OK 5 Too many bins Histograms Bar graphs The bars for each interval touch each The bars for each category do not touch each other. other. There are spaces between the bars. Histograms have a continuous, quantitative x-axis, with the x-values in order. Quantitative variables Bar graphs can have the categories on the x-axis listed in any order (alphabetical, biggest-tosmallest, etc.) Categorical variables Histograms Stemplots Quantitative variables Quantitative variables Good for big data sets, especially if technology is available. Good for small data sets, convenient for back-ofthe-envelope calculations. Rarely found in scientific or laymen publications. Uses a box to represent each data point. Uses a digit to represent each data point. 6 You’ve drawn your graph (histogram or stemplot). Now what? Look for overall pattern and any outliers. The pattern is described by shape, center, and spread. 1. Shape: o # of peaks (unimodal = 1, bimodal = 2, multimodal > 2) o Where the long tail is: Symmetric Right skewed Left skewed (long tail on the (long tail on the left) right) Median Mean Median < Mean Median > Mean To describe the shape, use a histogram with a smoothed curve highlighting the overall pattern of the distribution (don’t get overly detailed). 2. Center: (If the distribution is symmetric, the mean will equal the median, but otherwise these numbers are not the same.) 1 n a) Mean: arithmetic average, x xi ni1 Where n = the total # of observations And xi = an individual observation b) Mode: the most common number, biggest peak 7 c) Median: M, midpoint of the distribution such that ½ the observations are smaller and ½ the observations are larger. The median is not as affected by outliers as the mean is; the median is resistant to outliers. To find the median: i. Order the data form smallest to largest ii. Count the # of observations (n) n 1 iii. Calculate to find the center of the data set. 2 iv. If n is odd, M is the data point at the center of the data set. n 1 v. If n is even, falls between 2 data points, called the 2 ―middle pair.‖ M = the average of the middle pair Examples of center: Find the mean and median of the following 7 numbers in Dataset A: 23 25 32.5 33 67 1 -20 Find the mean and median of the following 8 numbers in Dataset B: 1 3. 2 4 6 8 9 12 13 Spread: a) Range = max – min (simplest, not always the most helpful) b) Variance: s2, average of the square of deviations of observations from the mean 1 n 2 s ( xi x )2 n 1i1 c) Standard Deviation: s, square root of the variance, common way for measuring how far observations are from the mean Example of finding the standard deviation by hand: 0, 2, 4 1. Calculate the mean. 2. Calculate the variance. 3. Take the square root of the variance. 8 d) Pth percentile: value such that p% of the observations fall at or below it Median = M = 50th percentile First Quartile = Q1 = 25th percentile Third Quartile = Q3 = 75th percentile How do you find quartiles? Think of them as ―mini-medians.‖ Leave the median out, and then find the median of what is left over on the left side (Q1) and what is left over on the right side (Q3). Find the 1st and 3rd quartiles of the following 7 numbers in Dataset A: -20 Min 1 23 25 M 32.5 33 67 Max Find the 1st and 3rd quartiles of the following 8 numbers in Dataset B: 1 Min 2 4 6 8 9 12 13 Max M=7 e) 5-Number Summary: Min Q1 M Q3 Max f) Interquartile Range (IQR) = Q3 – Q1 Call an observation a suspected outlier if it is: > Q3 + 1.5 IQR OR < Q1 – 1.5 IQR g) Boxplots: Graph of the 5-number summary Modified boxplots have lines extend from the box out to the smallest and largest observations which are NOT outliers. Dots mark any outliers. (We will always ask for the modified boxplot, but if there are no outliers, the modified and regular boxplots look exactly the same.) 9 Boxplot for Dataset A with 5number summary: -20, 1, 25, 33, 67 Since there were no outliers in this dataset, a regular boxplot and a modified boxplot look exactly the same for this data. For the online time example (with 2 additional data points added in), list the 5-number summary, find any outliers present, and show a boxplot and modified boxplot. 7 42 72 20 43 75 24 44 77 25 45 78 25 46 79 28 47 83 10 28 48 87 30 48 88 32 50 135 35 51 151 How do you know which method is best for determining center and spread? 5-Number Summary: better for skewed distributions or distribution with outliers Mean and Standard Deviation: good for reasonably symmetric distributions free of outliers. Always start with a graph! In the internet time example, here are how the mean/standard deviation and 5-number summary are affected by the outlier: Mean Standard Deviation 5-number summary With outlier (151) 54.77 32.647 7, 30, 46.5, 77, 151 With outlier removed from dataset 51.45 27.600 7, 29, 46, 76, 135 ―The Median vs. the Mean in the Age of Average‖ by Mike Pesca on NPR’s Day-to-Day 7/19/06: http://www.npr.org/templates/story/story.php?storyId=5567890 Do you always have to do all of this by hand? NO! Statistical software packages like SPSS can make life much easier for you, but it’s a good idea to know how to do these by hand so you can make sense of your output. Also, on the exam, you won’t have access to a computer. Read over your SPSS manual and get comfortable with using SPSS. You will have a chance to practice on the HW for this week, and you will work on it in lab on Friday. Enter your data, then Analyze--> Descriptive Statistics--> Explore. Follow the instructions on p. 48 of the SPSS manual. The output from SPSS for the internet time problem looks like: Descriptive s Time spent on the web Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviat ion Minimum Max imum Range Interquartile Range Skewness Kurt osis 11 Lower Bound Upper Bound Stat istic 54.77 42.58 Std. Error 5.961 66.96 52.13 46.50 1065.840 32.647 7 151 144 48 1.314 1.977 .427 .833 Stem-and-Leaf Plot Histogram Frequency Stem & Leaf 10 1.00 0 9.00 0 10.00 0 5.00 0 3.00 0 .00 1 1.00 1 1.00 Extremes Frequency 8 6 4 . . . . . . . 0 222222333 4444444455 77777 888 3 (>=151) 2 Stem width: Each leaf: Mean = 54.77 Std. Dev. = 32.647 N = 30 0 0 50 100 150 100 1 case(s) Time spent on the web Notice on the boxplot, it is easy to identify the potential outlier. This would be your indication that the 5-number summary would be the best way to describe your data. (You could also try calculating the mean and standard deviation without the outlier for comparison.) SPSS can also give you the Quartiles (listed under ―Percentiles‖), but these are not necessarily the same answers as what you would get by hand. The ―weighted average‖ and ―Tukey’s Hinges‖ are not the same method we use. For this class, whenever we ask you to calculate the Quartiles, we want you to do them by hand. 12 What if you want to compare the results from two or more different groups? Use side-by-side boxplots or back-to-back stemplots for your graphs. Female Male 9 2 81 3 6 4 5 330 6 8110 7 652 8 999 9 13 0 88 08 3459 22456 Preview of Section 1.3 (from Section 1.3) A z-score tells us how many standard deviations away from the mean an observation is. z x This is also called getting a standardized value. Why is standardization useful? For comparing apples to oranges. Example: (p. 88, Problem 1.99) Jacob scores 16 on the ACT. Emily scores 670 on the SAT. Assuming that both tests measure scholastic aptitude, who has the higher score? The SAT scores for 1.4 million students in a recent graduating class were roughly normal with a mean of 1026 and standard deviation of 209. The ACT scores for more than 1 million students in the same class were roughly normal with mean of 20.8 and standard deviation of 4.8. 14 How else can we use standardization? If the distribution of observations has a bellshape, then these standardized values have some special properties. One of these is the 68-95-99.7% Empirical Rule. Approximately 68% of the observations fall within 1 of the mean (between 1 and 1 ). Approximately 95% of the observations fall within 2 of the mean (between 2 and 2 ). Approximately 99.7% of the observations fall within 3 of the mean (between 3 and 3 ). P( -1 <X< +1 ) = 0.68 P( -2 <X< +2 ) = 0.95 P( -3 <X< +3 ) = 0.997 Standard deviations away from the mean (z-score), so a z-score of -2 could also be written as 2 , for example. The mean and the median of a bell-shaped curve are in the middle. This is shown with a 0 because the mean is 0 standard deviations away from itself. The most famous bell-shaped distribution is the Normal distribution. We will spend several lectures talking about it for Section 1.3, and it will be important to everything we do for the rest of the semester. 15 Example: Checking account balances are approximately Normally distributed with a mean of $1325 and a standard deviation of $25. a) Between what numbers do 68% of the balances fall? b) Above what number do 2.5% of the balances lie? c) Approximately what percent of balances are between 1250 and 1400? 16