Download PowerPoint Slides for Section 1.2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
1.2 Describing
Distributions with
Numbers
Center and spread are the most basic
descriptions of what a data set “looks like.”
They are intuitively meant to measure
exactly what comes to your mind when
you hear those terms, but the best way to
define them isn’t so obvious.
We first investigate center.
Center
What is the center?
Good question!
The most popular measure of center is the
mean or average value of a data set.
The mean of a data set is computed by
summing all data values and then dividing
by n. We denote the mean by x.
The Greek letter Σ (Sigma) is used as
shorthand to mean “sum up.”
Notation
Suppose we are
given the data values
10 4 7 12 1 4 10 8.
Compute Σx.
We may also combine
operations with x and
the summation.
Compute Σ(x+1).
A nice formula for the
mean may then be
written as
mean 
x

n
Problems with mean
Consider the data set 3 2 6 3 5 70 2 4 4 1.
Compute the mean of the data.
We have that μ=10, which hardly seems like a
good “measure of average.”
The problem can be attributed to the value 70.
A value which is considerably larger (or smaller)
than the rest of the data pattern is known as an
outlier.
Any measure of central tendency that is
“sensitive” to extreme values, such as above, is
not a resistant measure.
A Resistant Measure
The median for a collection of data values
is the number that is exactly in the middle
position of the list when the data are
arraigned in increasing order. We use M
to denote the median.
Let’s find the mean in the example above.
Comparison
The median is a resistant measure of
central tendency, unlike the mean, which
is a strong advantage. However, the
mean takes into account all numerical
values while the median only takes the
existence of the values into account.
There is a formula for the mean but only a
location for the median.
Inadequacies with Centers
What centers do not take into account is
the spread of the data. For example,
consider the following data sets: A={19 20
20 21} and B={1 20 20 39}.
In both data sets, the mean and median
are both 20, but the data of B is much
more spread out than the data of A.
Thus, using just a center to describe our
data is not good enough. We also need
spread.
Measuring spread
The most obvious measure of spread is the
range of a set of data; that is, the difference
between the highest and lowest data values.
The range of a data set is denoted R.
Above, R(A)=21-19=2 and R(B)=39-1=38.
But consider the data set C={1 50 50 50 50 50
50 50 50 50 99} and D={1 10 20 30 40 50 60 70
80 90 99}. Not only are the mean and median in
both data sets the same, but so is the range.
Percentile
The mth percentile is the number that
separates the bottom m% of the data from
the top (100-m)% of the data. It is
denoted by PM.
Note that the median of a data set is the
50th percentile so that P50=median.
Quartiles and IQR
We define the first, second, and third quartile by
Q1=P25, Q2=P50, and Q3=P75.
The interquartile range, denoted IQR, is defined
by IQR=Q3-Q1.
It is easy to convince yourself that IQR is a
resistant measure!
The 5 Number Summary
We study a way to
present a data set in
which the reader can
easily read off the
quartiles and the high
and low values of the
data set.
The following is an
example of a boxplot.
Min, Q1, M, Q3, Max
E.g. Consider the data set {33, 36, 37, 37 38, 41, 42, 42,
42, 45, 47, 52, 54, 55, 56, 56, 57, 60, 78, 92}. Construct
a boxplot.
To identify outliers, we use a modified boxplot. The idea
is that instead of drawing the whiskers from Q1 to the
lowest value and Q3 to the highest value, we draw the
upper whisker from Q3 to the largest data value between
Q3 and Q3+1.5xIQR. The lower whisker is drawn from
Q1 to the smallest data value between Q1 – 1.5xIQR and
Q1. Any data that is not plotted thus far is considered an
outlier and is plotted individually.
Construct a modified boxplot for the above data.
Another measure of spread
There are other things we can do. Let’s experiment
with the data set 2 3 7 12 on the board.
We have just “discovered” the standard deviation;
The standard deviation is denoted by s, and the
variance is denoted by s2. They are defined by
( x  x )
s
n 1
2
( x  x )
s 
n 1
2
2
Problems and problems
Compute the standard deviation for the
data sets A and B.
“[The standard deviation]… will be large if
the observations are widely spread about
their mean, and small if the observations
are close to their mean.”
Unfortunately, none of these measures of
dispersion is resistant; that is, they are
sensitive to outliers.