Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to be a quick reference for studying concepts as well as presenting the material from a different angle that might help you to better understand the statistics. For the Student: How do insurance companies determine various premium rates for different age groups and sexes? What information does the government use to decide who get taxed how much? How are you going to determine which vehicle is the safest to drive? The answer to each one of these questions relies heavily on statistics. Unfortunately there is a very large amount of statistics in society that has been manipulated very badly, and will provide you with unreliable results. Statistics is everywhere, and affects virtually everyone. The question to ask yourself now is “ Are you going to become another “victim of statistics”?”. Whether your career in the future is education, politics, or fire fighting, making decisions based off of statistics will be inevitable, and there will be consequences. Statistics is not hard, it only takes a little time and patience to gain a true understanding of what your information really ‘means’. To the Student A little hint in keeping yourself from being overwhelmed while first learning about statistics. It is not vital to memorize all of the equations. Although memorizing them can help, it is better to understand what the equations mean, why they are used, and when do you use them. Some of the equations, especially any equation that has a symbol requires adding a series of many numbers. Practically speaking you should use either a calculator or computer to compute such equations. Focus on understanding the concept of what the computer is doing so that the number that pops out is more than a number to you, because that number means something. Don’t memorize, just recognize. Graphs and Summaries A. One Categorical Variable 1. Graphs 2. Summaries B. One Numerical Variable 1. Graphs a. b. c. d. e. Applets Stemplot Histogram Boxplot Normal Quantile plot (Q-plot) Shapes i. Symmetric 1) Normal 2) Uniform ii. Skewed 1) Left 2) Right 2. Summaries a. Locations i. Mean ii. Median iii. Mode iv. Min and Max v. Quartiles vi. Comparisons of Mean and Median vii. Z-scores Empirical Rule b. Spreads i. Variance ii. Standard Deviation iii. Range iv. Interquartile Range (IQR) 3. Transformations a. Shift changes i. Centers ii. Spreads b. Scale changes i. Centers ii. Spreads Beginning Definitions Variable- the overall object of interest that is desired to be understood. ie. Percent of people who use Deodorant in America ie. Average debt of a college graduate from Texas A&M Individual- A single value constructed by a variable. ie. Bob, an American who does not use deodorant ie. Jill, a college graduate of Texas A&M with $10,000 in debt Variable Types • Categorical (Qualitative) -Nominal • Numerical (Quantitative) ie. Colors {red,blue,green} -Discreet ie. Number of Children in a family ie. Strength {weak,moderate,Strong} -Continuous ie. Amount of water the average house uses -Ordinal {Depending on the context, certain discreet numbers can be considered continuous for practical purposes, and continuous data can be made discreet} Distribution -Shape ie. unimodal, bimodal, multimodal, symmetric, skewed right, skewed left -Center ie. Mean, Median -Spread ie. Range, Standard Deviation,Variance, Interquartile Range Categorical Graphs (Nominal or Ordinal) • Pie Charts • Bar Graphs Index Pie Charts (Counts and Percents) Pie Graphs with Percents Pie Graph with Counts Country of Origin Country of Origin n=79 n=73 American American European Japanese European Japanese Pies show counts 19.51% 18.02% n=253 Pies show counts 62.47% Index Bar Graphs Bar Graph with Counts Bar Graph with Percents 250 60% 200 Perce nt Count 40% 150 100 20% 50 n=253 n=73 n=79 62% 18% 20% American European Japanese 0% American European Country of Origin Japanese Country of Origin Index Numerical Graphs (Univariate) • • • • Stemplots Histograms Boxplots Normal Quantile Plots (Q-plots) Index Stemplots Back to back stemplot Stemplot of Scores 3 | 178 4 | 567 5 | 09 6 | 3789 7 | 013355789 8 | 00124588 9|7 boys girls 18| 3 | 7 67 | 4 | 5 0| 5 | 9 7| 6 | 389 13379| 7 | 0558 1488| 8 | 0025 |9|7 Index Histogram Total Calories per bar of Common Candy 0.006 0.004 Density 8 6 0.002 4 0.000 2 0 Frequency 10 12 0.008 14 0.010 Total Calories per bar of Common Candy 100 150 200 250 Calories 300 350 400 100 150 200 250 300 350 400 Calories Note that these are analogous to counts and percents with bar charts Index Boxplots 100 150 200 250 300 350 Calories in Common Candies Boxplots are made using the 5 number summary to define the box and whiskers unless there are outliers present. If an outlier is present then the next minimum number not considered an outlier is chosen to represent the new minimum if the outlier or outliers where minimum numbers and vice versa if the outliers are considered maximum numbers. Outlier? A number is considered an outlier if it lies a distance of 1.5 times the IQR (Interquartile Range) lower than the 1st quartile or higher than the 3rd quartile. Index Normal Plots (aka. Q-plots) 250 200 150 100 Sample Quantiles 300 350 Calories in Common Candies -2 -1 0 Theoretical Quantiles 1 2 Q-plots are used to determine how reasonable it may be to assume that the sample comes from a normal distribution. If the sample comes from a normal distribution then the plot of the scatterplots should make a straight 45 degree line, or in the case where the Q plot includes a Q-line, the points should follow “closely” to the line. Unfortunately there is no clear rule for declaring a set of data normal or not. It takes practice of examining patterns in Q-plots to recognize “close calls”, but if the data is strongly skewed it will be very easy to see the change in pattern from the line. Index Shapes-Symmetric-Normal The blue histograms are samples from a population of test grades that have an average of 65 with a standard deviation of 10. Notice the one with more samples begins to look more like the density curve of a normal distribution (the red line) 0.03 0.00 0.01 0.02 Density 0.02 0.01 0.00 Density 0.03 0.04 1000 Samples~ Normal(65,10) 0.04 100 Samples~ Normal(65,10) 30 40 50 60 70 test grade 80 90 100 30 40 50 60 70 test grade 80 90 100 Shapes-Symmetric-Normal 100 Samples ~N(65,10) 1000 Samples ~N(65,10) 30 40 50 60 70 80 90 30 40 50 60 70 80 90 Boxplots QQplot of 1000 Samples 60 70 80 Normal plots 40 50 Sample Quantiles 70 60 50 Sample Quantiles 80 90 QQplot of 100 Samples -2 -1 0 1 Theoretical Quantiles 2 -3 -2 -1 0 1 Theoretical Quantiles 2 3 Index Shapes-Symmetric-Uniform QQplot of 100 Samples 40 50 60 70 80 70 80 90 60 50 50 40 40 30 0.00 30 60 70 Sample Quantiles 80 0.03 0.02 0.01 Density 100 Samples ~U(40,95) 90 100 0.04 100 Samples~ Uniform(40,95) 90 100 -2 test grade 1000 Samples ~U(40,95) 60 70 test grade 80 90 100 2 90 80 70 60 50 40 Sample Quantiles 80 70 60 50 40 30 50 1 QQplot of 1000 Samples 90 100 0.04 0.03 Density 0.02 0.01 0.00 40 0 Theoretical Quantiles 1000 Samples~ Uniform(40,95) 30 -1 -3 -2 -1 0 1 Theoretical Quantiles 2 3 Shapes-Skewed- Right and Left Right Skewed Left Skewed The other major pattern to recognize is skew. Think about a skewer on a barbeque grill. Everything seems lopped to one side of the stick. Likewise, the pattern in graphs is similar. If the majority of the data lies on the left then the graph is right skewed and viceversa. Index Shapes- Skewed Left 100 Samples Skewed left 80 180 160 140 120 80 100 80 0.000 60 100 Sample Quantiles 120 140 0.010 0.005 Density 160 0.015 180 200 100 Samples Skewed left 200 100 Samples Skewed right 100 120 140 160 180 200 -2 -1 Average Speed of Stock Cars 1 2 Theoretical Quantiles 1000 Samples Skewed left 150 100 50 0 Sample Quantiles 0.000 0 50 0.005 100 0.010 150 0.015 200 1000 Samples Skewed left 200 1000 Samples Skewed right Density 0 0 50 100 150 Average Speed of Stock Cars 200 -3 -2 -1 0 1 Theoretical Quantiles 2 3 Shapes-Skewed Right 60 80 80 60 20 0 40 80 60 40 20 40 20 0 0.000 0 100 120 100 Samples Skewed right Sample Quantiles 100 0.015 0.010 0.005 Density 100 Samples Skewed right 120 100 Samples Skewed right 100 120 140 -2 -1 Costs of Meals at Restraunts 1000 Samples Skewed right 200 Costs of Meals at Restraunts 150 50 0 100 50 Sample Quantiles 150 150 0 100 100 2 200 1000 Samples Skewed right 200 0.015 Density 0.010 0.005 0.000 50 1 Theoretical Quantiles 1000 Samples Skewed right 0 0 -3 -2 -1 0 1 Theoretical Quantiles 2 3 Summaries Locations - Mean Heights of students 71 70 68 69 68 65 72 69 71 62 x xi n x1 xn n 71 70 68 69 68 65 72 69 71 62 10 68.5 Index Summaries Location-Median Heights of students 71 70 68 69 68 65 72 69 71 62 Ordered heights 62 65 68 68 69 69 70 71 71 72 ~ Median = x 69 If the number of observations is even the Median is the average of the middle two numbers. If the number of observations is odd then the middle number of the order data is the Median. Heights of male students ~ x 65 68 70 71 72 70 Index Summaries Location-Mode, Min, Max Ordered heights 62 65 68 68 69 69 70 71 71 72 Mode= 69 & 71 Mode is most common number. If there is tie for the number of common numbers then there is more that one mode. Min= 62 Max=72 Index Summaries Location- Quartiles Ordered heights 62 65 68 68 69 69 70 71 71 72 1st Quartile = 68 3rd Quartile = 71 To find the 1st and 3rd Quartiles you consider the data separately to the left and to the right of the median. The median is the 2nd Quartile. The 1st Quartile is the middle number (or average of two middle numbers if the subset is even) between the minimum and the median. The 3rd Quartile is calculated the same way only replacing the max for the min. Technical note: Include the median when finding the 1st or 3rd Quartile if the number of observations is odd. Index Comparing Means and Medians Notice the blue and red lines on distribution graphs below. The blue line represent the mean and the red line represent the median. This demonstrates how whenever data becomes skewed the mean is affected more then the median. The bottom graph shows how the mean and median are about the same on a normal distribution. Medians Right Skewed Left Skewed Mean Normal Distribution Mean Index Median and Mean the Same Z - Scores Suppose we are given a set of data that has a normal distribution. Given that we already know the mean and the standard deviation we want to find precisely how many actual deviations a certain amount is. That value is called a z-score. The equation is: z x Why is the z-score useful to us? Well if we compare our z-score to the 68-95-99.7 rule we can learn about what percentage of values in greater than or less that our value. Suppose we had a z-score of 1.5. Obviously more than 68% of the value are below our value, meaning that we would have less than a 32% chance of choosing our particular value at random. Now consider that our value had a z-score of -2.5 meaning that it is 2 and 1/2 standard deviation to the left of the mean. Our new score lies between 95 and 99.7 which means that we had less than a 5% chance of selecting our value at random and more .3%. We can look up our z-score on a table of “Standard Normal Probabilities in order to find our exact chances of being so lucky. Index Z-Scores Based off the standard deviation, Z-Scores are used to determine how far a way a sample is from the mean. A Z-Score of 1 corresponds to one standard deviation from the mean. The 68-95-99.7 rule is helpful in determining what the value of a z-score really means. Figure 2 is density curve demonstrating what is meant by the 68-95-99.7 rule. The area under the blue contains 68% of the data. Where the blue ends is where z = 1 or z = -1. The red plus the blue contains 95 % of the data with the outer edges being z = 2 or z = -2. Likewise, the green added to the data contains 99.7% of the whole data. If we had a z-score of 0.5 we know that our number is somewhere in the blue. A z-score of 2.5 would lie somewhere in the green. Blue- 68% Z-scores Blue & Red- 95% Blue, Red & Green- 99.7% When to use 68-95-99.7 rule NORMAL Valid -2.00 .00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 Total Frequency 1 2 1 4 6 4 4 6 3 5 2 2 2 1 1 4 1 1 50 Percent 2.0 4.0 2.0 8.0 12.0 8.0 8.0 12.0 6.0 10.0 4.0 4.0 4.0 2.0 2.0 8.0 2.0 2.0 100.0 Valid Percent 2.0 4.0 2.0 8.0 12.0 8.0 8.0 12.0 6.0 10.0 4.0 4.0 4.0 2.0 2.0 8.0 2.0 2.0 100.0 Cumulative Percent 2.0 6.0 8.0 16.0 28.0 36.0 44.0 56.0 62.0 72.0 76.0 80.0 Statistics NORMAL N Mean Std. Deviation Percentiles Valid Missing 25 50 75 50 0 9.4200 4.6995 6.0000 9.0000 12.2500 84.0 86.0 88.0 96.0 98.0 100.0 When do we use the Empirical Rule? It is better to make a decision based off of graphs (histograms,boxplots,Q-plots), but if all we are given is the above we can notice some features about the distribution by observing the frequency column. The tallies need to be somewhat low in the top and bottom of this column with the data builiding up near the middle. Notice for this example this is what we have. If this pattern is apparent it is then necessary to compare the standard deviation of the data with the percentiles. If the data is normal then our standard deviation should contain about 68% of our data. According to the table 68% of the data lies between 5 and 14 for a length of 9. The standard deviation is 4.7 approximately 4.7, which with the empirical rule says that we expect about this distance is 9.4, so we conclude that the data has a Normal distribution Empirical Rule usage UNIFORM Valid -4.00 -2.00 -1.00 .00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 18.00 19.00 20.00 21.00 22.00 24.00 25.00 Total Frequency 1 3 3 2 1 1 1 1 3 4 1 1 3 1 1 2 3 1 5 1 1 4 1 2 2 1 50 Percent 2.0 6.0 6.0 4.0 2.0 2.0 2.0 2.0 6.0 8.0 2.0 2.0 6.0 2.0 2.0 4.0 6.0 2.0 10.0 2.0 2.0 8.0 2.0 4.0 4.0 2.0 100.0 Once again the two things we need to check for Statistics -pattern of the tallies UNIFORM N Valid -68% Interval Missing Valid Percent 2.0 6.0 6.0 4.0 2.0 2.0 2.0 2.0 6.0 8.0 2.0 2.0 6.0 2.0 2.0 4.0 6.0 2.0 10.0 2.0 2.0 8.0 2.0 4.0 4.0 2.0 100.0 Cumulative Percent 2.0 8.0 14.0 18.0 20.0 22.0 24.0 26.0 32.0 40.0 42.0 44.0 50.0 52.0 54.0 58.0 64.0 66.0 76.0 78.0 80.0 88.0 90.0 94.0 98.0 100.0 Mean Std. Deviation Percentiles 25 50 75 50 0 10.4400 8.3426 3.7500 10.5000 16.5000 Here we see that the frequency column has a pattern of higher tallies appears the same or bigger then the center of the tallies. But to be safe we consider the 68%Interval compared to the standard deviation. The lower bound of the interval is between (-1 and 0) the upperbound is between (19 and 20) Therefore the length of the interval is between 21 and 19. With the empirical rule we would expect this interval to be around 2 * 8.34= 16.68. Because this interval is clearly smaller than either of the previous we conclude that the data is not normal. Spreads- Variance Variance is a number that describes how much the data “varies”. The reason for the two different formula below is that one is that the first one is used if we have the mean of the population. The second equation divides by n – 1 because the variance of a sample will be smaller then the variance from the population that the sample comes from. However as n gets large there becomes very little difference between these two equations 2 2 xi n 2 x1 2 xn n 2 s xi x n 1 2 Index Spreads- Standard Deviation The Standard Deviation is just the square root of the variance. A standard deviation of “1” is exact the same as Z-score of one. Once again the difference between the two formula below are whether or not the data is the population or a sample from a population. xi n 2 s xi x n 1 2 Index Spread-Calculation of variance and standard deviation. Heights of male students 65 68 70 71 72 x 69.2 2 s 65 69.2 2 2 68 69.2 2 s 70 69.2 5 1 2 71 69.2 2 71 69.2 2 7.7 s 2.77 Index Summaries-Range and IQR Ordered heights 62 65 68 68 69 69 70 71 71 72 1st Quartile = 68 3rd Quartile = 71 Range = Max – Min = 72 – 62 = 10 Inter-Quartile-Range (IQR)= 3rd Quartile – 1st Quartile = 71 - 68 Index Transformations A Transformation is when each value of a data set is placed into the same function. For example if we add a number n to every observation we will have a transformed data set that is shifted n-units. If we multiply or divide every observation by the same number then the data set will have a new scale. If you are given a mean, (or ), and a standard deviation, s (or ), and want to convert your data so you have a new mean, new (or new), and new standard deviation, snew (or new), all you need to remember is what shift and scales changes affect. In our linear transformation formula: xnew a bx shift scale Index Transformation Standard deviation are only affected by scale changes, but means are affected by both shift and scales changes. This means that: xnew shift scale x snew scale s For example suppose College Station has an average annual temperature of 72 degrees with a standard deviation of 10 degrees. We want to know what these statistics are in Celsius. The formula for Celsius is: Celcius xnew snew 5 Farenheight 32 9 5 32 72 9 5 10 9 Celsius 8 scelsius 5.556 Index Transformations- Shifts Suppose we discover that a measuring instrument was off by 3 inches because someone was measuring from the top of the shoe to the head. Well obviously the given heights would not be the height of the subjects. If we assume every suject’s shoes where the same height of 3 inches then we can fix the data appropriately with the equation: xnew xi 3 Ordered heights 62 65 68 68 69 69 70 71 71 72 Shifted heights 65 68 71 71 72 72 73 74 74 75 Notice what this does to the following statistics. xnew x 3 2 2 snew s range 10 IQR 3 min x new Q1new ~ x new Q3new max x new min x 3 Q1 3 ~ x 3 Q3 3 max x What we see from this is that a shift change adds or substracts the same amount from every statistic that is not related to spread. The statistics that describe the spread (ie s2 and IQR) are not affected by the shift. 3 Index Transformations - Scale Going back to our original subjects for whom we have their height. Suppose that instead of inches we wanted to know how tall every one was in cm. 2.54 cm = 1 inch. Therefore our new data is as follows Ordered heights 62 65 68 68 69 69 70 71 71 72 Heights in cm 157.48 165.10 172.72 172.72 175.26 175.26 177.80 180.34 180.34 182.88 xnew 174 snew 7.69 Rangenew 25.4 IQRnew 7 min x new 157.48 Q1new 172.7 ~ x new 175.3 Unlike with the shifts notice that every single one of these statistics is affected by the scale change. Q3new 179.7 max x new 182.88 Index