Download Assessing the distribution EDA technique: the Histogram Example

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Assessing the distribution
EDA technique: the Histogram
an important decision the experimenter should make is what
type of distribution the data have
that is, answering the following questions:
what is the location of data?
perhaps the mean is not a good measure of the
"middle" of the distribution
one can consider something else, like the median:
· data are numerically ordered and the value which is exactly in the middle is
chosen
· e.g. for data 6, 3, 15, 0, 1
=5
· the mean is 6+3+15+0+1
5
· the mean is 3 (0, 1, 3, 6, 15)
how are data spread? Are they symmetrical or skewed?
are there modes?
are there outliers?
it is calculated by splitting data into equal sized "classes"
this can be done arbitrarily (although there are some
proposed theoretical rules)
for instance, see what’s the minimum and the maximum
values among the data
and split this range into N equal parts
then one counts how many values are there in each class
these values are plotted as a histogram (a bar chart)
Note: source for this material is the e-Handbook of statistical methods, available at
http://www.itl.nist.gov/div898/handbook/index.htm
the purpose of the Histogram is to graphically show:
1. the centre (the location) of data
2. the spread of data
3. the skewness of data
4. the presence of outliers
5. the presence of multiple modes
COMP106 - lecture 20 – p.1/12
Example
we run the experiment on the menu evaluation with 100 users
we ask each user to find a given option menu
for each user, we count the number of mistakes they make
before they find the right option
we check the data, and we find that the minimum number of
mistakes made by a user is zero
while the maximum number of mistakes is 15
COMP106 - lecture 20 – p.2/12
then we count:
how many users have made 0 or 1 mistake
how many users have made 2 or 3 mistakes
...
how many users have made 14 or 15 mistakes
suppose we have:
[0, 1] = 6
[8, 9] = 24
[2, 3] = 20
[10, 11] = 5
[4, 5] = 22
[12, 13] = 2
[5, 7] = 17
[14, 15] = 4
the histogram will be
we split the range [0 . . . 15] into, say N = 8 equal ranges:
[0, 1]
[8, 9]
[2, 3]
[10, 11]
[4, 5]
[12, 13]
[5, 7]
[14, 15]
COMP106 - lecture 20 – p.3/12
Analysing Histograms: symmetry
COMP106 - lecture 20 – p.4/12
Analysing Histograms: skews and modes
in a symmetric distribution the histogram has two "halves",
one is almost a mirror-image of the other
in skewed distributions, one tail is considerably longer than
the other
in a symmetric distribution, one can see the "body" (the center
of the distribution) where most of the data are, and the "tails"
(the extreme regions)
in these cases, it is difficult to estimate the "typical value" of
the distribution
the tail "length" indicates how fast these extremes go to zero
in skewed distributions one should give three values:
1. the mean
2. the median
3. the mode, that is the value that occurs most often
in short-tailed distributions, the extremes approach zero very
quickly (the histogram has a truncated look)
the length tells how good is the mean to estimate the location
of the distribution:
for moderate tails, it is a good choice
for very short tails, it is a poor choice: better use the midrange (=
smallest+largest
)
2
for very long tails it is a horrible choice: better use the median
COMP106 - lecture 20 – p.5/12
in symmetrical distributions, the center is a good typical value
this is the highest peak in the histogram
in fact, there could be more than one mode for a distribution
one can have modes for symmetric distributions too
skews can be caused by start-up effects (for instance, a user
can make lots of mistakes when using the menu for the first
COMP106 - lecture 20 – p.6/12
times, and improve with the use)
Analysing Histograms: outliers
in a symmetric distribution, an outlier is a point which is far
from the bulk of data
this can be caused by several factors
anomalous input
equipment failures
a change in the settings
etc.
Analysing Histograms: shape
a common shape to observe is the "normal" one, that is a
"bell" like shape
symmetric histogram
most of the frequency counts are in the middle
with the counts dying off out in the tails
in normal distributions, mean, median and mode are equivalent
sometimes it is practice to ignore the outliers, and concentrate
on the main distribution of data
this is generally wrong: outliers can give valuable information
and could be dangerous too
the ozone depletion problem could have been discovered much earlier if the satellite
detecting the measurements had not systematically purged the data coming from the
South Pole!!
COMP106 - lecture 20 – p.7/12
COMP106 - lecture 20 – p.8/12
Importance of the Normal Distribution
Examples
there are many other shapes (exponential, uniform, Cauchy...)
the normal is probably the most important:
many classical statistical tests are based on the assumption
that the data follow a normal distribution
in modeling applications the error term is often assumed to
follow a normal distribution with fixed location and scale
the normal distribution is used to find significance levels
in many hypothesis tests and confidence intervals
normal
symmetric, non normal,
short tailed
symmetric, non normal,
long tailed
symmetric, two modes
skewed (non symmetric)
symmetric with outlier
this is because a theorem provides a theoretical basis for its
wide applicability:
Central Limit Theorem: as the sample size (i.e. the number of values) increases
1. the distribution of the mean becomes approximately normal regardless of the
distribution of the original variable
2. the distribution of the mean is centered at the population mean of the original
variable
COMP106 - lecture 20 – p.9/12
COMP106 - lecture 20 – p.10/12
Normal Probability Plot
Examples
in order to be sure that the distribution is normal, a probability
plot should be used
in a probability plot, the values obtained from the experiment
are ordered
normal
non normal: data has fat tails
the plot has a S-shape, with the first points above
the straight line and the last ones below the straight
line
non normal: data has long tails,
the plot has a S-shape, with the first points below
the straight line and the last ones above the straight
line
non normal: data is skewed right, as all the points
are below the straight line
and then plotted against the values of the theoretical
distribution to test
these values can be obtained from tables, or calculated
if the plot forms a straight line, then the distribution chosen is
in fact the correct one
departures from this straight line indicate departures from the
specified distribution
in the normal probability plot, data are plotted against a
theoretical normal distribution
COMP106 - lecture 20 – p.11/12
COMP106 - lecture 20 – p.12/12