* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Stats PowerPoint (t
Survey
Document related concepts
Transcript
Histograms and Distributions HISTOGRAMS AND DISTRIBUTIONS Histograms and Distributions Suppose you want to know if athletes have faster reflexes than non-athletes? In order to get as close to the answer to this question as possible you decide to run an experiment: Using a web-based program you measure the reaction times of 25 athletes and 25 non-athletes under controlled conditions. Histograms and Distributions Frequency refers to how often a particular value appears in the data: Reaction Time frequency 230 1 231 0 232 0 233 2 234 0 235 0 236 0 237 0 Etc… Histograms and Distributions A Histogram is a plot of frequency: Histogram 2.5 1.5 Series1 1 0.5 0 200 207 214 221 228 235 242 249 256 263 270 277 284 291 298 305 312 319 326 333 340 frequency 2 Time (ms) This is a weak attempt at making an informative histogram…why? Histograms and Distributions bins It would be more informative to place the data into intervals called bins. You choose the appropriate bin size. The above bins have an interval of 10. Histograms and Distributions If the bin intervals are too small, the histogram will be too spread out… Histogram 2.5 1.5 Series1 1 0.5 0 200 206 212 218 224 230 236 242 248 254 260 266 272 278 284 290 296 302 308 314 320 326 332 338 frequency 2 Time (ms) The bins above have an interval of 1… Histograms and Distributions If the bin intervals are too large, the information will be too clumped: Histogram 13.2 13 frequency 12.8 12.6 12.4 Series1 12.2 12 11.8 11.6 11.4 201-280 281-360 Time (ms) The bins above have an interval of 80… bins Histograms and Distributions Let’s go back to a bin interval of 10 and look at the resulting histogram… Histograms and Distributions Histogram 4.5 4 frequency 3.5 3 2.5 2 Series1 1.5 1 0.5 0 Time (ms) This is a decent choice. Remember that all intervals must have the same size… Histograms and Distributions Histogram 4.5 4 frequency 3.5 3 2.5 2 Series1 1.5 1 0.5 0 Time (ms) SAMPLE SIZE: Currently the sample size is only 25 students in the non-athlete group. Let’s see what happens to our histogram as more data is collected (sample size increases)… Histograms and Distributions SAMPLE SIZE: The sample size is now 73 students. Let’s compare the before and after histograms… Histogram non-athletes 0 1 2 2 1 2 2 6 12 17 15 9 4 0 73 18 16 frequency bin 200-210 210-220 221-230 231-240 241-250 251-260 261-270 271-280 281-290 291-300 301-310 311-320 321-329 330-339 sample size 14 12 10 8 Series1 6 4 2 0 Time (ms) Histograms and Distributions SAMPLE SIZE: The sample size is now 73 students. Let’s compare the before and after histograms… Histogram (after) Histogram (before) 4.5 18 4 2.5 2 Series1 1.5 1 14 12 10 8 4 330-339 321-329 311-320 301-310 291-300 281-290 271-280 261-270 251-260 241-250 231-240 221-230 0 210-220 2 0 Time (ms) Series1 6 0.5 200-210 frequency 3 frequency 16 3.5 Time (ms) Histograms and Distributions We can imagine that our intervals are infinitely small and our sample size is infinitely large, which will result in the formation of a smooth curve: Histogram 18 16 frequency 14 12 10 8 Series1 6 4 2 0 Time (ms) Histograms and Distributions This curve is known as a Normal Distribution or Bell-Shaped Curve… It represents the probability of getting a data point in a given range or data. Histogram 18 16 frequency 14 12 10 8 Series1 6 4 2 0 Time (ms) Histograms and Distributions For example, the probability of you next measurement being between 261 and 341 is near 100%. Likewise, the probability of your next measurement being between 261 and 300 is around 50% as this is half the area under the curve. Histogram 18 16 frequency 14 12 10 8 Series1 6 4 2 0 Time (ms) Histograms and Distributions What is the probability of your next data measurement being 291.34544 ms? Near ZERO since this is only tiny fraction of the curve. Histogram 18 16 frequency 14 12 10 8 Series1 6 4 2 0 Time (ms) Descriptive Statistics DESCRIPTIVE STATISTICS Descriptive Histograms and Statistics Distributions Measures of Central Tendency 1. The MEAN: This should be something you can already perform on a data set. Sum the numbers and divide this by the number of numbers you have. It can by expressed mathematically by the equation above where x is a random variable that you are measuring and n is the number of measurements you have made. Descriptive Histograms and Statistics Distributions Measures of Central Tendency 2. The MEDIAN: This is simply the value in a data set that separates the higher half of a sample from the lower half. For example, in the sample to the right, the value that separates the higher and lower halves of data is 291ms, which is the median. Reaction Time (ms) 265 273 286 291 293 Just arrange the data from highest to lowest or vice versa and find the central number… 300 330 Descriptive Histograms and Statistics Distributions Measures of Central Tendency 2. The MEDIAN: This is simply the value in a data set that separates the higher half of a sample from the lower half. What if there is an even number of data points like shown on the right? Just average the two central measurement. In this case you average 286 and 291 to get a median of 289. Reaction Time (ms) 265 273 286 292 293 300 Descriptive Histograms and Statistics Distributions Careful with the MEAN and MEDIAN For example, a college boasts that the average starting salary of their last years graduating class was $362,000 per year. This sounds quite impressive… However, what they did not tell you was that the class size was 30 students of which 28 started at $30,000 a year and one student was first round draft pick in the NFL making approximately $10,000,000 per year. Histogram An outlier can be seen in the histogram to the right of our athlete data…perhaps the person blinked while the reaction time was being measured. 18 16 14 frequency Such a data point ($10,000,000 per year) can be considered an outlier, which is a data point much higher or lower than the rest of the data points. 12 10 8 Series1 6 4 2 0 Time (ms) Descriptive Histograms and Statistics Distributions Careful with the MEAN and MEDIAN For example, a college boasts that the average starting salary of their last years graduating class was $362,000 per year. This sounds quite impressive… However, what they did not tell you was that the class size was 30 students of which 28 started at $30,000 a year and one student was first round draft pick in the NFL making approximately $10,000,000 per year. What is the median of this data set? $30,000 The median is far less sensitive to outliers than the mean. Descriptive Histograms and Statistics Distributions Careful with the MEAN and MEDIAN So should we be focusing on the median more than the mean???? No. Generally speaking, the mean is TYPICALLY a far more accurate measurement in terms of central tendency than the median when outliers have been dealt with. To convince yourself, try this exercise from Seeing Statistics (www.seeingstatistics.com): The median is more resistant to extreme, misleading data values so it would seem to be the clear choice. However, we also need to consider accuracy. Is the median or the mean more likely to be close to the true value? To evaluate the relative accuracy of the median and the mean, let's consider how they do when we know the true center of the data. Suppose that the only possible scores are the whole numbers between 0 and 100. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 The center of these 101 numbers, whether we use the median or the mean, is 50. What if we were to select five numbers randomly from this set of 101 and calculate the median and mean of those five numbers? Would the median or the mean be closer to what we know is the true value of 50? Descriptive Histograms and Statistics Distributions Measures of Spread 1. The RANGE: This is simply the length of the smallest interval containing all of the data For example, the range of the data to the right would be… 265 ms to 300 ms Reaction Time (ms) 265 273 286 However, the range suffers from the same drawbacks as the mean and even more so in terms of describing data due to, once again, … outliers. 292 293 300 Descriptive Histograms and Statistics Distributions Measures of Spread 1. The RANGE: This is simply the length of the smallest interval containing all of the data Calculate the range now with the addition of one new measurement that happens to be an outlier: 265 ms to 734 ms Reaction Time (ms) 265 273 286 The range is more sensitive to outliers than the mean because with a large sample size, the effect on the mean is diluted. 292 293 300 734 Descriptive Histograms and Statistics Distributions Measures of Spread 2. The INTERQUARTILE RANGE: The interquartile (between quarters) range is one way around the outlier issue. This value is calculated by first splitting the data up into four sections (quarters) from low to high with the same number of data points in each section as shown below: The interquartile range is the range between the number that defines the upper end of Quarter 1 (Q1) and the lower end of Quarter 3 (Q3)…let’s look at an example. Descriptive Histograms and Statistics Distributions Measures of Spread 2. The INTERQUARTILE RANGE: Calculate the interquartile range of this data: A. Find the median 268 ms, the 13th value B. Now find the median of the first half of the data excluding the 13th value (231 + 231) / 2 = 231 ms = Q1 C. Find the median of the second half of the data excluding the 13th value (290 + 294) / 2 = 292 ms = Q3 D. The interquartile range is 231 ms to 292 ms. It is also sometimes stated as Q3 – Q1, which would be 61 ms in this case. Descriptive Histograms and Statistics Distributions Measures of Spread 2. The INTERQUARTILE RANGE: If you start with an even number of data points as shown to the right then… Split the data in half and find the median of each half. In this case one would split the data between values 12 and 13. A. The median of the top half is 231 ms again. B. The median of the bottom half is (287 + 290)/2 = 288.5 (289) ms. C. The interquartile range is 231 ms to 289 ms. It is also sometimes stated as Q3 – Q1, which would be 58 ms in this case. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) The Standard Deviation is simply a value describing the distance from the mean in BOTH directions that will encompass 68% of your data on average. Therefore, σ is a direct measure of the spread of your data…let’s look at a quick example. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) This histogram shows blood pressure data for a large sampling of adult males. The mean is around… 82 mmHg σ is around…10 mmHg What does this mean? It means that between 82 +/- 10 mmHg (between 72 and 92 mmHg) falls 68% of the data points. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Therefore, the more spread out your data is… …the greater the value of σ. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) To take it a step further, two standard deviations away from the mean on both sides (+/- 2σ) will encompass… 95% of the data. Likewise, +/- 3σ will encompass 99.7% of the data. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) How does one calculate the Standard Deviation (σ)? Let’s go back to our athlete/non-athlete reaction time data to see how this is done starting with the nonathlete sample… Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Where should we begin? − By calculating the mean (X)… 278.5 ms Now what? (think about what σ tells us) It describes the spread of the data (or width of the normal distribution / bell-shaped curve). Therefore, it is only logical to find how far away all of your data is from the mean… Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Non-Athletes Individual Reaction Time (ms) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X 210 225 233 233 247 256 257 268 270 274 276 278 282 286 287 295 298 298 300 305 307 311 314 324 329 − X-X -68.52 -53.52 -45.5 -45.5 -31.5 -22.5 -21.5 -10.5 -8.5 -4.5 -2.5 -0.5 3.5 7.5 8.5 16.5 19.5 19.5 21.5 26.5 28.5 32.5 35.5 45.5 50.5 210-278.5 225-278.5 233-278.5 233-278.5 … − =278.5 ms X − X – X = the mean minus the measured value Now we are starting to get an idea about how spread out the data is from the mean, which is what σ is all about. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Non-Athletes Individual Reaction Time (ms) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X 210 225 233 233 247 256 257 268 270 274 276 278 282 286 287 295 298 298 300 305 307 311 314 324 329 − X-X -68.52 -53.52 -45.5 -45.5 -31.5 -22.5 -21.5 -10.5 -8.5 -4.5 -2.5 -0.5 3.5 7.5 8.5 16.5 19.5 19.5 21.5 26.5 28.5 32.5 35.5 45.5 50.5 − (X - X)2 4694.9904 2864.3904 2070.25 2070.25 992.25 506.25 462.25 110.25 72.25 20.25 6.25 0.25 12.25 56.25 72.25 272.25 380.25 380.25 462.25 702.25 812.25 1056.25 1260.25 2070.25 2550.25 The next step is to… − square all of the differences (X - X)2 Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Non-Athletes Individual Reaction Time (ms) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X 210 225 233 233 247 256 257 268 270 274 276 278 282 286 287 295 298 298 300 305 307 311 314 324 329 − X-X -68.52 -53.52 -45.5 -45.5 -31.5 -22.5 -21.5 -10.5 -8.5 -4.5 -2.5 -0.5 3.5 7.5 8.5 16.5 19.5 19.5 21.5 26.5 28.5 32.5 35.5 45.5 50.5 − (X - X)2 4694.9904 2864.3904 2070.25 2070.25 992.25 506.25 462.25 110.25 72.25 20.25 6.25 0.25 12.25 56.25 72.25 272.25 380.25 380.25 462.25 702.25 812.25 1056.25 1260.25 2070.25 2550.25 Then… You, for the most part, average the squares: − (X - X)2 / n-1 The reason one uses n-1 is to account for sample size. If n is large you are essentially dividing by n and averaging. If n is small like a sample size of n=3, then n1 makes a large difference in the resulting prediction of σ. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Non-Athletes Individual Reaction Time (ms) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X 210 225 233 233 247 256 257 268 270 274 276 278 282 286 287 295 298 298 300 305 307 311 314 324 329 − X-X -68.52 -53.52 -45.5 -45.5 -31.5 -22.5 -21.5 -10.5 -8.5 -4.5 -2.5 -0.5 3.5 7.5 8.5 16.5 19.5 19.5 21.5 26.5 28.5 32.5 35.5 45.5 50.5 − (X - X)2 4694.9904 2864.3904 2070.25 2070.25 992.25 506.25 462.25 110.25 72.25 20.25 6.25 0.25 12.25 56.25 72.25 272.25 380.25 380.25 462.25 702.25 812.25 1056.25 1260.25 2070.25 2550.25 Then… You essentially average the squares: − (X - X)2 / n-1 = 998.2 This number is known as the variance and is directly related to the spread of your data. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Non-Athletes Individual Reaction Time (ms) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X 210 225 233 233 247 256 257 268 270 274 276 278 282 286 287 295 298 298 300 305 307 311 314 324 329 − X-X -68.52 -53.52 -45.5 -45.5 -31.5 -22.5 -21.5 -10.5 -8.5 -4.5 -2.5 -0.5 3.5 7.5 8.5 16.5 19.5 19.5 21.5 26.5 28.5 32.5 35.5 45.5 50.5 − (X - X)2 4694.9904 2864.3904 2070.25 2070.25 992.25 506.25 462.25 110.25 72.25 20.25 6.25 0.25 12.25 56.25 72.25 272.25 380.25 380.25 462.25 702.25 812.25 1056.25 1260.25 2070.25 2550.25 One more step to get σ… Square root the “average” to go back: √ − (X - X)2 / n-1 = 31.6 This is the standard deviation (σ). What does this number mean? Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Non-Athletes Individual Reaction Time (ms) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X 210 225 233 233 247 256 257 268 270 274 276 278 282 286 287 295 298 298 300 305 307 311 314 324 329 − X-X -68.52 -53.52 -45.5 -45.5 -31.5 -22.5 -21.5 -10.5 -8.5 -4.5 -2.5 -0.5 3.5 7.5 8.5 16.5 19.5 19.5 21.5 26.5 28.5 32.5 35.5 45.5 50.5 − (X - X)2 4694.9904 2864.3904 2070.25 2070.25 992.25 506.25 462.25 110.25 72.25 20.25 6.25 0.25 12.25 56.25 72.25 272.25 380.25 380.25 462.25 702.25 812.25 1056.25 1260.25 2070.25 2550.25 It means that ACCORDING TO THE CURRENT DATA, 68% of future data collected should fall between 279 +/- 31.6. Read the red text above over and over… as your stats are only as good as your data. Use common sense. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Standard deviation formula (what we just did): Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Your turn, athlete data… Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Your turn, athlete data… 264.4 +/- 30.6 ms Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Summary of current data: athletes Mean +/- σ Nonathletes 264 +/- 30.6 279 +/- 31.6 What does it mean? … patients Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) The significance of the standard deviation: The graph on the right shows two data sets having the SAME mean. What is different then? The blue data set has a greater spread and therefore a larger σ. Which data set would you prefer (if you had a choice)? The red one as there is less noise / variability. Variability is an inevitable limitation in the methods we use to observe nature. It is your job to make as precise a measurement as possible thereby limiting the variability. Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Compare the histograms of non-athletes to athletes: Histogram Histogram 4.5 4 4 3.5 3.5 3 2.5 Series1 2 1.5 frequency 3 2.5 Series1 2 1.5 1 1 0.5 0.5 Non-athletes Better yet, overlay the histograms… reaction time (ms) Athletes 330-339 321-329 311-320 301-310 291-300 281-290 271-280 261-270 251-260 241-250 231-240 221-230 330-339 321-329 311-320 301-310 291-300 281-290 271-280 261-270 251-260 241-250 231-240 221-230 210-220 200-210 reaction time (ms) 210-220 0 0 200-210 frequency 4.5 Histograms and Distributions Compare the histograms of non-athletes to athletes: Mean +/- σ athletes Nonathletes 264 +/- 30.6 279 +/- 31.6 Number of students (frequency) 4.5 4 3.5 3 2.5 Non-athletes Series1 2 Series2 Athletes 1.5 1 Q: Is there really a difference between these two groups??? 0.5 0 What should we do? Reaction time (ms) Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Collect more data (larger sample size), which is really the only option at this point… bin 200-210 210-220 221-230 231-240 241-250 251-260 261-270 271-280 281-290 291-300 301-310 311-320 321-329 330-339 sample size non-athletes athletes 0 1 2 2 1 2 2 6 12 17 15 9 4 0 73 18 3 6 8 12 15 10 8 6 3 3 2 1 0 0 77 16 14 12 10 8 6 4 2 0 Series1 Nonathletes Series2 Athletes Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) Series2 Number of students (frequency) Series1 200-210 210-220 221-230 231-240 241-250 251-260 261-270 271-280 281-290 291-300 301-310 311-320 321-329 330-339 Number of students (frequency) 18 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 16 14 12 10 8 Series1 6 Series2 4 2 0 Reaction time (ms) Mean +/- σ Reaction time (ms) athletes Nonathletes 264 +/- 30.6 279 +/- 31.6 Sample size: 25 in each group (N=50) Mean +/- σ athletes Nonathletes 251 +/- 30.8 298 +/- 28.5 Sample size: 73 in non-athletes 77 in athletes Descriptive Histograms and Statistics Distributions Measures of Spread 3. The STANDARD DEVIATION (σ or s) What you should notice is that the means changed dramatically and the two goups are beginning to separate indicating that there may actually be a difference. There is no substitute for carefully collected / high quality data and a large sample size. Series2 Number of students (frequency) Series1 200-210 210-220 221-230 231-240 241-250 251-260 261-270 271-280 281-290 291-300 301-310 311-320 321-329 330-339 Number of students (frequency) 18 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 16 14 12 10 8 Series1 6 Series2 4 2 0 Reaction time (ms) Mean +/- σ Reaction time (ms) athletes Nonathletes 264 +/- 30.6 279 +/- 31.6 Sample size: 25 in each group (N=50) Mean +/- σ athletes Nonathletes 251 +/- 30.8 298 +/- 28.5 Sample size: 73 in non-athletes 77 in athletes Descriptive Statistics Measures of Spread 3. The STANDARD DEVIATION (σ or s) Let’s go back to the small sample size data… Number of students (frequency) 4.5 4 3.5 Mean +/- σ athletes Nonathletes 264 +/- 30.6 279 +/- 31.6 3 2.5 Series1 2 Series2 1.5 1 0.5 0 Reaction time (ms) How can we determine if there is a significant difference between these two groups? Histograms and Distributions T-Test assesses whether the means of two groups are statistically different from each other Histograms and Distributions Histograms and Distributions Histograms and Distributions = Standard Error of the difference Histograms and Distributions Histograms and Distributions Histograms and Distributions Therefore the t-value is related to how different the means are and how broad yours data is. A high t-value is obviously what you hope for… Calculate the t-score Histograms and Distributions t = -1.61 -Degrees of freedom is the sum of the people in both groups minus 2 df = 48 Histograms and Distributions The null hypothesis vs the hypothesis 1. The hypothesis: Athletes will have a quicker reaction time than non-athletes. 2. The null hypothesis: The null hypothesis always states that there is no relationship between the two groups or there is no difference in reaction time between athletes and nonathletes. Histograms and Distributions The p-value 1. The p-value is a number between 0 and 1. 2. It is the probability (hence the p-value) that there is no difference between the groups supporting the null hypothesis. 3. Therefore, the probability that there is a difference between the two groups is 1 minus the p-value. 4. In order for the data to support the hypothesis, the p-value must be high or low? The p-value should be low (<0.05), which says that there is less than a 5% chance that there is no difference between the two groups. Therefore, there is greater than 95% chance that there is a difference. Histograms and Distributions Statistical Significance When the p-value is less than 0.05, we say that the data is statistically significant, and there may be a real difference between the two groups. Be warned that just because p is less than 0.05 between two groups doesn’t mean that there is actually a difference. For example, if we find p < 0.05 for the reaction time experiment, it doesn’t mean that there is a definite difference between athletes and non-athletes. It only means that there is a difference in our data, but our data might be flawed or there is not enough data yet (sample size too small) or we measured the data improperly, or the sampling wasn’t random, or the experiment was garbage, etc… Doubt is the greatest tool of any scientist (person). Histograms and Distributions How is the p-value determined? The p-value is found by using a standard t-table in combination with the t-value and the degrees of freedom previously determined: http://bioinfo-out.curie.fr/ittaca/documentation/Images/ttable.gif http://davidmlane.com/hyperstat/t-table.html http://www.graphpad.com/quickcalcs/Pvalue2.cfm Histograms and Distributions Now you determine the p-value for your data. Histograms and Distributions 1. Begin by choosing the dependent variable like grade for example. Since the T-test can only look at two groups simultaneously and there are four grades, we need to perform all the possible combinations (there was apparently only one 9th grader and therefore the sample size is too low to look at this grade): 10th vs 11th 10th vs 12th 11th vs 12th We also would want to know if the mean of each group is significantly different than the actual value. Actual value vs 10th Actual value vs 11th Actual value vs 12th This needs to be done twice, once for the line estimation and once for the dots estimation!! Histograms and Distributions These are the tables you need to fill out: Grade Mean SD Variance 10th 11th 12th Gades Difference of means Variability of Groups T-score P-value 10th vs actual 11th vs actual 12th vs actual 10th vs 11th 10th vs 12th 11th vs 12th Write a conclusion based on your analysis. Remember, just because p < 0.5 it doesn’t necessarily mean you hypothesis is supported!