Exploratory Data Analysis PS372 Spring 2010 The four features of distributions Central Location – where are most of the observations? Spread – how far apart are the observations? Shape – Symmetric or skewed? Outliers – are any observations very far from the rest? What type of data do you have? Nominal – observations are in categories. Examples are gender (male, female) or eye color (blue, green, brown, other) Ordinal – observations can be ranked (i.e. greater than or less than makes sense). Examples are education level (less than high school, high school, bachelor’s, graduate degree) or agreement with a survey question (strongly disagree, disagree, ambivalent, agree, strongly agree) Scales, continued Interval – an observation is on an interval scale if the difference between two numbers has meaning. In measuring temperature, 95 is 5 degrees higher than 90 degrees, 30 is 5 degree higher than 25 degrees, etc. This is not true of all data. For example, is the difference between “agree” and “strongly agree” the same as the difference between “ambivalent” and “agree”? Scales, continued Ratio – The strongest form of scale. It indicates that ratios (division) of numbers has meaning. If your income is 20,000 then you have twice as much income as someone who makes 10,000. Temperature in Fahrenheit or Celsius is NOT ratio scaled, since 10 degrees is not “twice as hot” as 5 degrees. In a ratio scale, 0 has meaning as “nothing”. In those temperature scales, 0 is arbitrary. Back to the four features Central location – where are most of the observations? What the observations are in categories, the most relevant “statistics” are either the number and/or frequencies in each category. For example, “50.5% of live births are male, while 49.5% are female”. Alternatively, “in our town, we have 934 men and 982 women” It doesn’t make sense to talk about “the average person’s gender” since you really can’t be somewhere in the middle. Mode One statistic mentioned often for categorical data (ordinal or nominal) is the mode, which is the category with the most observations. The mode is most meaningful when one of the categories has most of the observations, as in “most faculty at UK have doctoral degrees” If the data is spread among many categories, knowing the mode doesn’t provide a full picture. For example, “the largest department in Arts and Sciences at UK is Psychology” does not say anything about the majority of faculty. Summary – the mode often isn’t that useful. Central Location for Interval/Ratio For interval/ratio data, the most common measures of central location are the mean and median. The mean is defined as the arithmetic average of the observations. You find this by adding them up and dividing by the total number. If your observations are (1,5,12), the mean is (1+5+12)/3 = 6. Mean/Median continued The median is the “middle” observation of the SORTED data. If your observations are (1,5,12), the median is 5. If your observations are (4,10,2,8,9), the median is 8. If there is an even amount of data, average the two middle values. So if the data are (6,10,4,3), the middle values are 4 and 6, and (4+6)/2 = 5. The median is 5. Differences between the mean and median The median is robust, which means that outliers do not affect it. The mean is not. Suppose we have data (1,4,6,10,12). The mean is 33/5 = 6.6 while the median is 6. Suppose we change the 12 to 14000. The median is still 6, but the mean changes to 14021/5 = 2804.2. Note also that the median is still close to most of the data, but the mean is nowhere close to any data point. Spread For ordinal/nominal data, we do NOT have a measure of spread in this class. There are measures of spread, not discussed in this class, for ordinal/nominal data. Essentially, this measures indicate whether the data is spread evenly into all the categories or whether one or a few categories contain almost all the data. The notion is called entropy. Not required in our class, but look it up if you need it. Spread for interval/ratio Some common measures of spread for interval/ratio data are the range, the interquartile range, and the standard deviation. The range is simply the distance between the smallest and largest observations. It is obviously not robust to outliers, and seldom used except when the spread is very small. (i.e. if all the scores on an exam happened to be between 76 and 78, which doesn’t happen very often) Interquartile range First, we have to define the quartiles. Recall when we compute the median, we are dividing the data in half. The quartiles divide each of the halves in half again (this divides the data into four parts, hence the term quartile) To find the quartiles, first sort the data as if you were finding the median. Quartiles continued If n is even, divide the data in half, thus creating a first half and a second half If n is odd, remove the median, and then divide the data in half to produce a first half and a second half. The first quartile, Q1, is the median of the first half. The third quartile, Q3, is the median of the second half. (Q2 is the median). Example of computing quartiles Suppose our sorted data was 12, 14, 23, 36, 40, 42, 44, 61, and 78. There are n=9 numbers, so find the median M=40 and remove it. The first half is (12,14,23,36) and the second half is (42, 44, 61, and 78). The median of the first half is Q1=(14+23)/2 = 18.5 while the median of the second half is Q3 = (44+61)/2 = 52.5 Interquartile range The interquartile range is Q3 – Q1. It is not sensitive to outliers. We used the data 12, 14, 23, 36, 40, 42, 44, 61, and 78. If we changed the 78 to 100,000 then the interquartile range (IQR) does not change. Standard deviation The standard deviation is based on measuring the average squared distance from the mean. It is defined as X n i 1 X i n 1 2 Standard deviation continued The standard deviation is sensitive to outliers. If one of the observations is very large, then the standard deviation will be large as well. Unless there are strong outliers, the standard deviation is the most commonly used measure of spread. This is because the standard deviation is directly related to normal distributions (bell curves), which we will study later. Interlude – review of central location and spread For nominal/ordinal data, we simply report the percentages in each category. For interval/ratio data, central location is usually measured by the mean (not robust) or the median (robust). For interval/ratio data, spread is usually measured by the standard deviation (not robust) or the Interquartile Range (robust) The mode (central location) and the range (spread) are rarely used for inference. Shape Look at the “tails”. If the tails are equal length, then the distribution is symmetric If the tail for lower values is longer, the distribution is left skewed If the tail for higher values is longer, the distribution is right skewed. “Symmetric” gets the benefit of the doubt in describing a distribution. “Roughly symmetric” is fine. I will not put judgment calls on homework or exams. Symmetric Data – Ideally and Practically Right skewed data – ideally and practically Left skewed data – ideally and practically Outliers Recall outliers are any points that appears separate from the rest. Often this is a judgment call. Saying “mild outlier” is fine, I don’t intend on policing judgment calls. Outliers often occur with skewed data in the direction of the long tail. Boxplots A boxplot is intended to be a SIMPLE plot which allows you to quickly see all the features of the distribution. In PS372 you will NOT be expected to draw a boxplot from scratch, but you will be expected to interpret a boxplot drawn on a computer. Step 1 for boxplot – The Box Box extends from Q1 to Q3, with a line for the median. Thus, you can immediately see the median (central location) and the IQR (spread). Note the box contains 50% of the data Q3 Median Q1 Step 2 for boxplot – The fences Construct the “fences”. These are NOT in the final product. They are just used to make decisions on outliers. Inner fences are 1.5 IQR from the box, outer fences are 3.0 IQR from the box. 1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR Q3 Median Q1 Step 2 for boxplot – Inner Fences Construct the “fences”. These are NOT in the final product. They are just used to make decisions on outliers. Inner fences are 1.5 IQR from the box, outer fences are 3.0 IQR from the box. 1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR Inner fences Step 2 for boxplot – Outer fences Construct the “fences”. These are NOT in the final product. They are just used to make decisions on outliers. Inner fences are 1.5 IQR from the box, outer fences are 3.0 IQR from the box. 1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR Outer Fences Step 3 for boxplot – Whiskers The whiskers extend from the box to the point closest, but still inside, the inner fence. Remember, the whiskers end at a data point, not the inner fences. 1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR Whiskers Step 4 for boxplot – Mild outliers Mild outliers for a boxplot are defined to be points located between the inner and outer fences. They are denoted by open circles. 1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR Mild outliers Step 5 for boxplot – Extreme outliers Extreme outliers for a boxplot are defined to be points located beyond the outer fences They are denoted by filled circles. 1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR Extreme outliers Final boxplot Remember, the fences are not actually drawn. You can see the four features of distributions easily with a boxplot. Outliers, for example, are explicitly drawn. Using Boxplots Central location is shown through the median (some boxplots will show the mean as a separate line). Using Boxplots Spread is shown through the IQR (you cannot get the standard deviation from a boxplot). You can also see the range of the data, but remember the range is often not that useful. Using Boxplots Shape can be seen through the box and the whiskers. If one side of the box and the corresponding whisker are longer, then the data is skewed that direction (here left skewed) Using boxplots Sometime the box “leans” one way and the whiskers the other. Then you can’t tell that much about shape from the boxplot. This happens most often in small datasets, where there isn’t much information about shape in the entire dataset anyway. Remember that symmetric always gets the benefit of the doubt, so a slight “lean” isn’t enough to conclude skewness. Outliers are of course drawn explicitly on the plot, and while you don’t have to take their definitions of “mild” and “extreme” as absolute truth, it can be handy. Some variants Some people and/or computer programs add some “bells and whistles” to this basic boxplot. For example, Stata will often put a “+” in the boxplot showing the location of the mean. Side by side boxplots When comparing multiple groups of people (or anything else), boxplots provide a handy method for comparison. My placing the boxplots side by side, you can immediately see similarities and differences in central location, spread, and shape. 1970 Draft Lottery – months on x axis, draft number on y axis. Conclusions There is clear evidence the later months, especially December, fared far worse in the draft lottery than other months. This draft was redone later after the unfairness was noted by many sources. Review There are four features of distributions – central location, spread, shape, and outliers Central location can be measured by the mode (nominal or ordinal data) or the median or mean (interval/ratio data) In interval/ratio data, spread can be measured by the range (rarely useful), the IQR, or the standard deviation. More review Outliers are any points far from the other points. This definition is deliberately vague. Two people may disagree over whether a point is an outlier. There is an explicit definition of outlier for a boxplot (any point more extreme than Q1 – 1.5 IQR or Q3 + 1.5 IQR), but that is NOT etched in stone More review Shape is in “the tails”. If the tails are equal length, then the distribution is symmetric If the tail for lower values is longer, the distribution is left skewed If the tail for higher values is longer, the distribution is right skewed. Describing a single distribution When describing a distribution, or comparing two distribution, you need to mention all four features of the distributions, noting where they are similar and where they are different. For example, “all the distributions have the same spread (IQR is around 5, standard deviation is around 7), but distribution A is, on average, much higher than distribution B (mean for A is 78 while the mean for B is 70). Both distributions are symmetric and have no outliers”. Example Two classrooms were observed, with one classroom (n=21) using “new directed reading activities” and another classroom (n=23) not using the activities. This might be useful for an exploratory study, but cannot provide conclusion evidence of anything, as the classrooms differ on far more than just “activities” or “no activities” (for example, the teachers differ) Example continued Descriptive statistics For the controls, n=23, mean=41.52, M=53, std.dev = 17.15, IQR=26 For the treatment group n=21, mean=51.47, M=42, std.dev = 11.00, IQR=14 An example paragraph summary The two groups vary most on spread, both in terms of standard deviation (17.15 for the controls and 11.00 for the treatment group) and IQR (26 for the control group and 14 for the treatment group). The difference in spread is sufficient that the control group extends beyond the treated group both for high and low scoring students. Paragraph summary continued On average, scores are higher in the treatment group. The mean of the treatment group is 51.48 compared to a mean of 41.52 for the controls (the respective medians are 53 and 42). Both groups appears approximately symmetric (perhaps a slight right skew for the control group) and have no outliers.