Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Assessing the distribution EDA technique: the Histogram an important decision the experimenter should make is what type of distribution the data have that is, answering the following questions: what is the location of data? perhaps the mean is not a good measure of the "middle" of the distribution one can consider something else, like the median: · data are numerically ordered and the value which is exactly in the middle is chosen · e.g. for data 6, 3, 15, 0, 1 =5 · the mean is 6+3+15+0+1 5 · the mean is 3 (0, 1, 3, 6, 15) how are data spread? Are they symmetrical or skewed? are there modes? are there outliers? it is calculated by splitting data into equal sized "classes" this can be done arbitrarily (although there are some proposed theoretical rules) for instance, see what’s the minimum and the maximum values among the data and split this range into N equal parts then one counts how many values are there in each class these values are plotted as a histogram (a bar chart) Note: source for this material is the e-Handbook of statistical methods, available at http://www.itl.nist.gov/div898/handbook/index.htm the purpose of the Histogram is to graphically show: 1. the centre (the location) of data 2. the spread of data 3. the skewness of data 4. the presence of outliers 5. the presence of multiple modes COMP106 - lecture 20 – p.1/12 Example we run the experiment on the menu evaluation with 100 users we ask each user to find a given option menu for each user, we count the number of mistakes they make before they find the right option we check the data, and we find that the minimum number of mistakes made by a user is zero while the maximum number of mistakes is 15 COMP106 - lecture 20 – p.2/12 then we count: how many users have made 0 or 1 mistake how many users have made 2 or 3 mistakes ... how many users have made 14 or 15 mistakes suppose we have: [0, 1] = 6 [8, 9] = 24 [2, 3] = 20 [10, 11] = 5 [4, 5] = 22 [12, 13] = 2 [5, 7] = 17 [14, 15] = 4 the histogram will be we split the range [0 . . . 15] into, say N = 8 equal ranges: [0, 1] [8, 9] [2, 3] [10, 11] [4, 5] [12, 13] [5, 7] [14, 15] COMP106 - lecture 20 – p.3/12 Analysing Histograms: symmetry COMP106 - lecture 20 – p.4/12 Analysing Histograms: skews and modes in a symmetric distribution the histogram has two "halves", one is almost a mirror-image of the other in skewed distributions, one tail is considerably longer than the other in a symmetric distribution, one can see the "body" (the center of the distribution) where most of the data are, and the "tails" (the extreme regions) in these cases, it is difficult to estimate the "typical value" of the distribution the tail "length" indicates how fast these extremes go to zero in skewed distributions one should give three values: 1. the mean 2. the median 3. the mode, that is the value that occurs most often in short-tailed distributions, the extremes approach zero very quickly (the histogram has a truncated look) the length tells how good is the mean to estimate the location of the distribution: for moderate tails, it is a good choice for very short tails, it is a poor choice: better use the midrange (= smallest+largest ) 2 for very long tails it is a horrible choice: better use the median COMP106 - lecture 20 – p.5/12 in symmetrical distributions, the center is a good typical value this is the highest peak in the histogram in fact, there could be more than one mode for a distribution one can have modes for symmetric distributions too skews can be caused by start-up effects (for instance, a user can make lots of mistakes when using the menu for the first COMP106 - lecture 20 – p.6/12 times, and improve with the use) Analysing Histograms: outliers in a symmetric distribution, an outlier is a point which is far from the bulk of data this can be caused by several factors anomalous input equipment failures a change in the settings etc. Analysing Histograms: shape a common shape to observe is the "normal" one, that is a "bell" like shape symmetric histogram most of the frequency counts are in the middle with the counts dying off out in the tails in normal distributions, mean, median and mode are equivalent sometimes it is practice to ignore the outliers, and concentrate on the main distribution of data this is generally wrong: outliers can give valuable information and could be dangerous too the ozone depletion problem could have been discovered much earlier if the satellite detecting the measurements had not systematically purged the data coming from the South Pole!! COMP106 - lecture 20 – p.7/12 COMP106 - lecture 20 – p.8/12 Importance of the Normal Distribution Examples there are many other shapes (exponential, uniform, Cauchy...) the normal is probably the most important: many classical statistical tests are based on the assumption that the data follow a normal distribution in modeling applications the error term is often assumed to follow a normal distribution with fixed location and scale the normal distribution is used to find significance levels in many hypothesis tests and confidence intervals normal symmetric, non normal, short tailed symmetric, non normal, long tailed symmetric, two modes skewed (non symmetric) symmetric with outlier this is because a theorem provides a theoretical basis for its wide applicability: Central Limit Theorem: as the sample size (i.e. the number of values) increases 1. the distribution of the mean becomes approximately normal regardless of the distribution of the original variable 2. the distribution of the mean is centered at the population mean of the original variable COMP106 - lecture 20 – p.9/12 COMP106 - lecture 20 – p.10/12 Normal Probability Plot Examples in order to be sure that the distribution is normal, a probability plot should be used in a probability plot, the values obtained from the experiment are ordered normal non normal: data has fat tails the plot has a S-shape, with the first points above the straight line and the last ones below the straight line non normal: data has long tails, the plot has a S-shape, with the first points below the straight line and the last ones above the straight line non normal: data is skewed right, as all the points are below the straight line and then plotted against the values of the theoretical distribution to test these values can be obtained from tables, or calculated if the plot forms a straight line, then the distribution chosen is in fact the correct one departures from this straight line indicate departures from the specified distribution in the normal probability plot, data are plotted against a theoretical normal distribution COMP106 - lecture 20 – p.11/12 COMP106 - lecture 20 – p.12/12